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1  INTRODUCTION 


Abstract  Advances  in  computing  technology  at  the  desktop  level  promise  improved 
efficiency  for  performing  computationally-intensive  statistics.  In  this  report,  we 
demonstrate  that  low-cost,  widely  available  digital  signal  processing  chips  employed 
within  a  personal  computer  environment  improve  statistical  processing  speeds  by  two 
orders  of  magnitude  over  conventional  approaches. 

1.1  Background:  Statistical  Computing  Trends 

Recent  developments  in  statistical  computation  emphasize  the  use  of  computation-intensive 
nonparametric  techniques.  These  techniques  include  bootstrapping,  jackknifing,  nonparametric 
regression,  density  estimation,  and  other  similar  methods  [Eddy  1986a].  Other  recent  trends  include 
Bayesian  analysis  [Berger  1985]  and  exploratory  data  analysis  [Friedman  1987]. 

These  and  other  computation-intensive  statistical  problems  are  often  solved  on  mainframes 
or  supercomputers.  Although  these  installations  can  provide  the  necessary  processing  speed,  they 
suffer  from  restricted  or  delayed  access,  lack  of  security,  lack  of  direct  interactive  system  control,  and 
often  poor  technical  assistance.  There  are  also  problems  with  reliable  data  transmission  and 
displaying  of  the  results.  These  factors  account  for  the  user  dissatisfaction  reported  in  [Goldberg 
1988,  Rushinek  1986]. 

To  alleviate  the  computational  burden  on  expens. /e,  larger  computers  and  provide 
convenience,  the  last  decade  has  seen  an  increase  in  the  use  of  personal  computers  for  many 
numerical  tasks.  Algorithms  and  programs  that  had  principally  been  the  domain  of  costly  mainframe 
computers  have  been  routinely  transferred  to  the  more  affordable  personal  computers.  The  reasons 
for  using  mainframes  in  the  first  place  (memory,  speed,  precision)  have  decreased  in  importance  as 
the  personal  computers’  performance  has  improved  by  orders  of  magnitude.  However,  in  spite  of 
these  advances,  the  personal  computer  is  still  not  optimal  for  certain  tasks. 

For  example,  a  10  MHz  286-type  personal  computer  (PC)  running  Turbo  C  compiled  code 
will  take  5  minutes  to  multiply  two  100x100  matrices  (—2x10*  floating  point  operations).  Adding  a 
numeric  coprocessor  will  decrease  this  time  to  1  minute.  An  enhanced  PC  with  a  higher  clock  speed, 
more  efficient  microprocessor,  and  optimized  code  reduces  this  by  an  order  of  magnitude.  In 
comparison,  a  first  generation  supercomputer  (CRAY-1)  will  need  O.OIS  seconds  to  perform  the  same 
operation  [Klinger  1982],  while  the  recent  (CRAY-2)  and  future  supercomputers  (CRAY-3)  will 
reduce  the  time  by  an  order  of  magnitude  and  more  [Erisman  1988]. 

Since  computation-intensive  statistical  procedures  often  require  many  iterations,  it  is  apparent 
that  conventional  personal  computers  (which  often  take  days  to  solve  these  problems)  are  impractical. 
Thus,  there  is  a  growing  need  to  provide  interactive,  graphics-oriented,  high-performance,  and  low-cost 
statistical  computing  power  in  a  microcomputer-based  environment 

The  goal  will  be  to  develop  a  flexible,  interactive,  high-speed,  and  low  cost  statistics 
workstation  (SW)  capable  of  solving  a  wide  range  of  complex  statistical  problems.  Our  investigation 
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offers  a  potential  solution  to  this  problem  through  the  use  of  drgirn/  sifftal  processing  (DSP)  chips 
acting  as  peripheral  computatum  engines  to  the  main  PC  processor^. 

1.2  PC-host,  DSP-based  Statistics  Workstation 

The  fastest  growing  use  of  statistical  analysis  took  is  within  the  single-user.  PC-type 
environment  A  particular  PC  application  may  consist  of  a  single  complex  computation  on  a  large 
data  set  Another  example  may  be  a  computation  involving  a  resampling  technique,  such  as 
bootstrapping,  on  a  small  data  set  Other  applications  may  involve  a  complicated  factorial  analysis 
or  real-time  data  analysis.  In  general,  these  types  of  statistical  problems  routinely  involve  multivariate 
sets  of  data  that  are  broken  down  into  arrays  and  matrices  for  further  processing. 

The  statistical  analysis  techniques  that  use  data  in  matrix  and  array  form  are  often  ideal 
candidates  for  advanced  array  and  vector  processing  methods.  Unfortunately,  the  typical  PC 
architecture  is  not  well  suited  for  floating  point  computation  on  arrays.  Therefore,  improvements  in 
array  and  vector  processing  speed  could  be  potentially  achieved  if  more  sophisticated  tools  are 
interfaced  to  a  PC.  Architectures  such  as  vector,  parallel,  or  systolic  processors  are  effective,  but 
presently  available  only  in  supercomputers  or  in  specialized  parallel  computers  [Petersen  1983,  Eddy 
1986e].  These  also  require  advanced  software  compilation  techniques,  such  as  the  unrolling  of  loops 
(which  add  additional  memory  requirements),  to  make  them  effective  [Grier  1988]. 

As  an  alternative,  a  DSP  is  a  special  purpose  microprocessor  optimized  for  floating  point 
processing  of  '^ata  arrays.  Due  to  the  DSP’s  potential  for  array  processing,  we  investigated 
computation-intensive  statistics  as  a  prime  application  of  a  tugfi-speed  sin^-user  workstation  which 
use  these  devices  to  do  the  bulk  of  the  computatum.  Since  it  is  a  low-cost  commercial  device,  DSP’s 
are  particularly  cost-effective  for  the  proposed  application.  These  processors  were  introduced  in  the 
last  decade  to  perform  filtering  algorithms  in  real  time  for  a  wide  variety  of  applications  [HPS  1990]. 

The  DSP  architecture  consists  of  a  high-speed  parallel  microprocessor  which  contains  two 
specialized  units.  These  are  a  data  arithmetic  unit  (DAU)  for  floating  point  operations  and  a  control 
arithmetic  unit  (CAU)  for  integer  control,  l^e  floating  point  processor  performs  parallel 
multiplication  and  result  accumulation,  while  the  integer  processor  performs  the  address  pointer 
update  for  the  next  operation.  All  of  these  operations  are  done  concurrently  during  the  instruction 
cycle. 


DSP  circuits  achieve  their  speed  advantage  by  their  parallel  and  pipelined  architecture  and 
the  use  of  optimized  algorithms.  The  DSP  architecture  is  designed  for  single  instruction  cycle 
multiply-accumulate  (MAC)  instruction  processing  and  index  updating.  A  typical  DSP  instruction 
written  in  pseudo-code  is  given  by: 

A[i]  =  aO  =  al  -I-  BO]  -  C[k]  Eq.(l.l) 

This  could  alternatively  be  written  using  pointer  notation, 


'  In  the  following,  we  refer  to  SW  as  the  hardware  and  software  needed  for  a  DSP-based  statistical  analysis 
workstation  running  in  a  microcomputer-based  environment. 


2 


*A++  =  aO  =  al  +  *8++  *C+  + 


Eq.(1.2) 


where  A[i],  B[j],  and  C[k]  represent  data  elements  and  aO  and  al  represent  floating  point 
accumulators.  Thus,  a  single  instruction  may  contain  as  many  as  flve  address  references:  two 
referencing  floating  point  accumulators  and  three  referencing  physical  memory  locations.  If  the  use 
of  these  instructions  is  maximized  in  statistical  algorithms,  a  significant  speedup  can  be  achieved. 

The  specialized  architecture  enables  the  DSP  to  achieve  up  to  a  two  orders  of  magnitude 
improvement  over  the  conventional  numeric  coprocessor.  However,  unlike  the  numeric  coprocessor, 
the  DSP  is  not  a  simple  plug-in  device  that  interfaces  directly  with  the  main  processor  and  higher 
order  languages.  The  use  of  a  DSP  in  a  PC  requires  a  commercially  available  add-on  board  and  the 
development  of  additional  software  for  interfacing  the  DSP  to  the  main  processor  and  statistical 
applications  programs.  Transferring  this  technology  to  personal  computers  and  to  statistical  analysis 
and  prediction  problems  holds  great  potential,  and  one  of  the  main  objectives  of  this  study. 

1.3  Phase  I  Research  Objectives 

The  primary  objective  of  the  Phase  I  effort  was  to  evaluate  the  conceptual  feasibility  of  the 
DSP-based  statistics  workstation.  This,  in  turn,  led  to  the  identiflcation  of  a  number  of  specific 
objectives  and  their  corresponding  research  tasks,  as  described  below. 

U.l  Algorithm  Selection,  Evaluation,  and  Optimization 

In  this  effort,  the  DSP  software  development  emphasis  is  on  the  algorithms  that  require 
repetitive  calculations  or  the  so-called  "computation-intensive  statistics",  such  as  bootstrapping,  etc. 
[Diaconis  1983].  Noreen  [1989]  has  predicted  that  a  major  trend  in  statistical  programming  packages 
will  be  including  these  computation-intensive  algorithms.  Further  development  emphasis  will  be  on 
computations  requiring  iteration  and  where  global  optimization  is  not  possible,  such  as  projection 
pursuit  regression. 

This  study  focuses  on  adapting  and  fine-tuning  the  basic  statistical  algorithms  for  use  in  DSP 
applications,  not  on  developing  new  high-level  algorithms.  The  statistical  computation  algorithms  will 
then  be  applied  to  a  high-speed,  single-user  workstation  concept.  As  noted  with  supercomputer 
applications  [Harrod  1987],  optimization  of  the  low-level  algorithms  (such  as  the  basic  linear  algebra 
subprograms  (BIAS)  of  UNPACK  [Dongarra  1979]  )  has  resulted  in  greatly  improved  performance 
of  many  of  the  high-level  routines  [Bates  1987].  The  effectiveness  of  this  approach  with  statistical 
problems  and  the  DSP  was  confirmed  here  as  welP. 

Emphasis  is  also  placed  on  selecting  the  most  efficient  algorithms  for  the  processor.  For 
example,  many  of  the  algorithms  optimized  for  the  fast  Fourier  transform  (FFT)  on  conventional 
processors  have  concentrated  on  reducing  multiplications  at  the  expense  of  additions  [Blahut  1985]. 


^  Optimization  of  the  computation-intensive  algorithms  cannot  be  taken  too  lightly.  Lucky  [1989]  noted  that 
algorithm  development  accounted  for  most  of  the  computing  speed  improvement  in  the  past  several  decades.  He 
estimated  that  4  orders  of  improvement  have  come  from  device  speed  and  that  7  orders  have  come  from  inmroved 
algorithms.  The  algorithms  contributing  most  have  used  special  symmetries  in  the  data,  such  as  the  fast  Fourier 
transform,  Toeplitz  matrices,  etc. 
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However,  with  the  DSP  chips  now  in  use,  multiplications  have  virtually  the  same  overhead  as 
additions,  so  algorithms  were  optimized  with  this  in  mind. 

UJ  laterfiMe  Definition 

The  statistics  workstation  will  be  designed  as  an  extension  to  the  conventional  personal 
computer,  using  the  DSP  as  a  "number  cruncher”.  The  main  processor  in  the  PC  will  be  responsible 
for  handling  program  and  data  management,  data  display,  and  support  of  the  user  interface. 

Since  this  interface  has  a  major  effect  on  the  operational  efficiency  of  the  workstation,  a 
working  prototype  suitable  for  a  feasibility  investigation  was  developed  and  u^  for  interface  design, 
algorithm  optimization,  and  expected  performance  evaluation. 

133  PcTfomumcc  and  Cost  Evaluation 

Since  a  DSP-based  computer  for  statistical  analysis  is  an  unconventional  concept,  its 
acceptance  will  depend  on  achieving  considerable  improvement  in  speed,  while  keeping  the  cost  low. 
Our  objective  will  be  to  provide  solutions  to  a  relatively  wide  range  of  computation-intensive 
statistical  problems  which  currently  require  the  use  of  supercomputers. 

In  this  report,  a  detailed  tradeoff  analysis  (Section  9)  is  made  to  select  the  best  DSP-based 
statistics  workstation  configuration.  This  selection  is  based  on  measured  benchmarks  taken  during 
this  study.  The  analysis  also  presents  recommendations  for  software  development,  memory  sizing,  etc. 
This  will  enable  a  selection  of  the  most  promising  algorithms  to  be  incorporated  in  the  statistics 
workstation. 

1.4  Concept  Feasibility  Questions  and  Answers 

The  proposed  concept  feasibility  questions  that  were  posed  for  the  Phase  I  effort  and  the 
corresponding  conclusions  are: 

1.  Can  the  basic  algorithms  needed  for  the  statistical  analysis  and  forecasting  be  modified  to 
provide  a  substantial  improvemera  in  processing  using  the  DSP  hardware?  -  Our  investigation  revealed 
that  the  majority  of  the  statistical  algorithms  could  be  modified  to  take  advantage  of  the  unique 
features  of  the  DSP  and  thus  achieve  a  substantial  improvement  in  speed. 

2.  Are  there  any  bottlenecks  that  could  reduce  the  aqpected  performance  of  the  prt^rosed 
approach?  -  No  major  bottlenecks  were  found.  However,  if  the  statistical  algorithms  contained  a 
large  percentage  of  operations  that  required  integer  or  conditional  operations,  then  the  performance 
improvement  was  much  lower. 

3.  Wuit  is  the  effect  of  the  interface?  How  can  it  be  improved?  -  The  proposed  DSP 
interface  was  easily  implemented  through  a  formal  deHnition  procedure.  For  computation-intensive 
statistical  computations,  the  effect  of  the  interface  was  minimal.  A  faster  data  transfer  speed  through 
a  32-bit  interface  will  be  available  in  the  next-generation  devices. 
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1.5  Report  Outline 


The  remainder  of  this  report  addresses  the  specific  tasks  necessary  to  evaluate  the  feasibility 
of  a  DSP-based  statistics  workstation. 

Section  2  presents  an  overview  of  statistical  user  needs  and  existing  statistical  software 
packages.  This  section  also  discusses  new  trends  in  statistical  techniques. 

Section  3  deals  with  the  selection  of  computationally-intensive  statistical  analysis  techniques. 
In  this  section,  we  give  a  brief  summaiy  of  the  statistical  algprithms  and  their  feasibility  for  use  on 
DSP’s.  The  algorithms  which  are  readily  available  for  PC’s  and  perform  actequately  in  their 
commercial  software  form  are  disregarded. 

The  identification  of  low-level  building  blocks  is  presented  in  Section  4.  The  statistical 
algorithms  are  broken  down  into  low-level  components  for  effective  use  in  the  DSP.  These  are  then 
tabulated  into  a  Ubrary  of  subroutines  similar  to  the  supercomputer  BLAS  routines. 

Section  5  deals  with  the  proposed  statistics  workstation  design.  It  presents  the  design 
philosophy,  defines  objectives,  and  discusses  key  components  of  the  workstation.  The  choice  of  the 
most  suitable  DSP  chip  is  based  on  the  low-level  building  block  requirements.  Due  to  their  favorable 
architecture,  multi-  or  parallel  processing  using  DSP  chips  within  the  statistics  workstation 
environment  is  also  considered. 

Section  6  discusses  the  DSP-based  software  development  and  explains  how  the  generic  DSP 
subroutines  are  coded  and  inserted  into  high-level  statistical  algorithms.  The  use  of  a  high-level, 
portable  language  for  development  is  advised. 

The  definition  of  interface  requirements  between  the  PC  and  DSP  is  presented  in  Section  7. 
To  get  the  optimal  performance  from  the  host  processor  and  DSP,  several  interfacing  schemes  are 
reviewed.  Automated  generation  of  the  software  interface  between  the  PC  host  and  the  DSP  using 
a  software  description  language  (SDL)  is  found  to  be  useful. 

The  statistics  workstation  performance  evaluation  is  in  Section  8.  The  performance  of  the 
DSP-based  workstation  is  compared  against  the  stand-alone  PC  version  through  dual  software 
development.  The  dual  development  is  simplified  through  the  use  of  a  portable  language  such  as  C. 

The  overall  statistics  workstation  performance/cost  evaluation  results  are  presented  in  Section 
9.  Cost  estimates  are  then  made  for  a  range  of  workstation  configurations. 

The  final  section.  Section  10,  presents  our  conclusions  and  recommendations.  The 
recommendations  for  the  statistics  workstation  design  are  based  on  the  evaluation  test  results.  The 
workstation  architecture  and  software  support  recommendations  are  briefly  described  below. 

Hardware.  The  architecture  recommendations  include  selection  of  DSP  type,  speed,  memory, 
hardware  interface,  and  other  implementation  aspects.  One  approach  to  the  statistics  workstation 
would  be  to  provide  a  complete  turnkey  system,  since  few  users  will  be  familiar  with  the  DSP 
hardware  and  the  internal  computer  system  configuration.  Another  approach  would  be  to  provide 


5 


an  add-on  kit  for  upgrading  an  existing  PC,  although  port-compatibility  problems  could  increase  the 
risk  of  this  approach. 

Software.  Software  support  for  the  statistics  workstation  is  identified  on  several  levels,  starting 
at  the  BLAS  level  and  proceeding  to  the  applications  level.  For  each  level,  software  modules  must 
be  coded  and  optimized.  In  the  feasibility  study,  several  program  modules  containing  computation¬ 
intensive  statistical  algorithms  were  coded  and  tested  on  a  commercially  available  board.  As  the 
largest  speed  improvements  were  found  on  the  algorithms  that  used  the  low-level  subroutines 
effectively,  we  recommend  these  for  future  development 
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2  STATISTICAL  COMPUTING  NEEDS 


Since  the  advent  of  computers,  scientific  researchers  have  desired  interactive,  high-speed,  and 
cost  effective  machines  capable  of  solving  complex  statistical  and  forecasting  problems.  Recent  trends 
in  statistical  research  have  made  this  desire  a  high  priority. 

In  the  past,  supercomputers  have  often  been  used  to  solve  the  more  complicated  problems. 
However,  this  approach  is  not  always  possible  for  many  researchers  due  to  acquisition  cost  (starting 
at  $3,000,000),  operational  cost  (supercomputer  time  is  $2000/hr  and  up),  and  scarcity  (about  300 
supercomputer  systems  available  in  the  United  States)  [Goldberg  1988]. 

Given  that  the  five  NSF-sponsored  supercomputer  centers  can  provide  only  limited  service 
to  the  academic  community,  most  of  the  researchers  have  to  use  the  more  easily  available, 
mainframes,  minicomputers,  and  workstations.  However,  even  the  larger  minicomputers  and 
workstations  are  not  always  widely  available.  A  1986  survey  of  statistics  departments  at  30  major 
Ph.D.  granting  universities  showed  that  70%  of  the  statistics  departments  did  not  have  workstations 
and  that  53%  of  the  departments  lacked  graphics  terminals  [^dy  1986a].  The  same  survey  also 
showed  that  hardware  acquisition  had  the  highest  priority  but  was  limited  by  the  available  funding. 

As  nonparametric  and  computation-intensive  statistical  techniques  gain  in  popularity,  the 
demand  for  an  efficient  statistics  workstation  to  support  these  computations  will  increase.  In  this 
effort  we  have  attempted  to  develop  a  low-cost  solution  to  these  needs  by  proposing  the  development 
of  a  DSP-based  statistical  analysis  workstation.  Since  this  workstation  will  be  used  in  the  existing 
statistical  computing  environment,  a  short  review  of  the  current  statistical  user  needs  and  existing 
statistical  computing  packages  and  supporting  tools  follows. 

2.1  Statistical  User  Needs 

We  expect  that  several  different  groups  of  users  will  be  interested  in  conducting  computation¬ 
intensive  statistical  analysis  and  will  need  a  statistics  workstation  to  support  their  eflbrts.  The 
majority  of  the  initial  workstation  users  will  be  from  those  academic,  industrial,  and  government 
communities  which  already  have  been  exposed  to  computation-intensive  techniques. 

Although  the  needs  of  the  user  communities  diRer,  they  will  have  a  common  interest  in  a  low- 
cost  solution  because  of  the  limited  budget  that  is  normally  available  for  statistical  investigations.  The 
only  area  that  we  cannot  address  immediately  with  the  statistics  workstation  concept  are  those 
applications  that  are  memory  intensive,  requiring  more  storage  than  is  normally  available  on  a  low- 
cost,  desk-top  environment. 

University  environment.  The  university  environment  is  more  research  oriented  and  requires 
a  wider  range  of  capabilities  than  the  industrial  counterpart.  The  key  university  users  will  include 
statisticians  and  scientiHc  researchers  in  applied,  medical,  and  social  science  areas.  As  mentioned  in 
Section  1,  there  is  a  well-established  need  for  advanced  statistical  computing  capability  [Eddy  1986a]. 

Industrial  environment.  Many  of  the  industrial  applications  will  be  manufacturing  oriented  and 
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will  involve  quality  control  or 
process  optimization  computations. 

With  the  current  emphasis  on 
quality,  major  improvements  in 
statistical  process  control  (SPC), 
statistical  quality  control  (SQC),  and 
simulation  methods  will  be  desired. 

Furthermore,  many  of  the  SQC 
applications  are  computation¬ 
intensive  because  of  the  large 
number  of  variables  involved  which 
determine  the  product  quality  [Love 
1988,  Lewi  1982].  Most  of  the 
process  optimization  in  the  past  has 
been  performed  in  an  offline  batch 
mode,  often  on  a  daily  or  weekly 
basis.  However,  performing  the 
operations  in  an  offline  mode 
involves  delay.  This  can  lead  to 
resource  waste  if  the  process 
operates  less  than  optimally  between  adjustments.  Thus,  the  use  of  an  online  approach  may 
contribute  to  product  quality  improvement  (see  Figure  2.1).  Here,  the  emphasis  will  be  on  fast 
computation  capabilities  and  on  processing  significant  data,  typically  obscured  by  noise,  in  a  real-time, 
on-line  mode  [Electronics  1990).  To  be  cc»t-effective,  high-spe^  specialize  computers  will  be 
needed  for  this  task. 


Figure  2.1  Online  statistical  process  control. 


With  these  users  in  mind,  as  well  as  other  users  in  the  fields  of  medicine,  business,  etc.,  the 
necessary  statistical  software  methods  and  tools  for  a  desk-top  environment  can  be  determined. 

2.2  Statistical  Software  Program  Development 


Many  statistical  programs  have  been  developed  for  specific  applications,  often  requiring  a  staff 
of  programmers  or  at  least  one  person  expending  a  great  deal  of  effort.  Because  these  programs 
have  been  written  to  perform  a  specific  job,  they  can  be  optimized  for  speed.  However,  some  of  the 
drawbacks  of  creating  user  speciHc  applications  include: 

o  Each  special  purpose  program  requires  a  new  development  effort. 

o  Special  purpose  programs  can  be  flexible,  but  only  for  those  options  included.  Any 
maintenance  effort  or  modifications  on  the  programs  will  require  additional  expense. 

These  are  both  prime  considerations  when  starting  any  software  project.  However,  to  shorten 
the  development  time  and  reduce  software  development  cost,  existing  statistical  libraries  should  be 
used  whenever  possible.  This  is  where  an  established  library  such  as  IMSL  or  NAG  [McCullagh 
1983]  can  be  of  use.  However,  before  using  libraries  indiscriminantly,  it  is  important  to  verify  that 
the  individual  modules  are  compatible,  and  well  understood.  Modules  should  fully  debugged  and 
test  data  made  available.  Corrupt  random  number  generators  are  an  example  of  poorly  designed 
routines  that  have  been  included  in  some  libraries  [Lewis  1989]. 
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As  an  example  of  a  widely  used  and  heavily  debugged  library,  IMSL  offers  a  wide  range  of 
mathematical  and  statistical  functions  (over  500  FORTRAN  subroutines).  The  basic  features  of  the 
IMSL  STA/PC-LIBRARY  are  listed  in  statistics  groupings  1  through  17  in  the  next  section. 

The  IMSL  subroutines  are  also  common  to  many  of  the  canned  statistical  packages.  In  this 
regard,  the  ease  of  use  and  convenience  of  the  latter  programs  makes  them  preferred  over  user- 
speciGc  programs  whenever  they  are  available. 

2.3  Conventional  Statistical  Analysis  Packages 

Several  hundred  statistical  computer  program  packages  are  available  for  the  PC  environment 
The  majority  of  these  programs  are  similar  in  their  hindamental  capabilities  (and  algorithms)  and 
cover  almost  every  statistical  evaluation  need.  However,  many  of  these  programs  are  not  suitable  for 
the  computation-intensive  tasks  because  of  their  speed  and  available  hardware,  not  to  mention  that 
few  have  the  algorithms  required.  Thus,  it  will  be  important  to  offer  in  a  statistics  workstation  those 
features  which  are  not  now  available  in  the  PC  environment  but  are  needed  for  more  complex 
analyses. 

Since  detailed  reviews  of  statistical  programs  are  already  available  [Woodward  1988,  Fridlund 
1990],  extensive  reviews  of  the  available  programs  were  not  attempted.  The  available  reviews, 
however,  helped  in  determining  the  features  to  be  used  in  the  proposed  statistics  workstation.  Most 
of  the  significant  packages  contain  at  least  the  following  computational  capabilities  (not  including  the 
file  handling,  graphics,  and  other  features).  Of  those  features  marked  with  an  asterisk  (*),  we  have 
done  limited  prototyping  for  feasibility  studies.  Some  of  these  computations  will  often  be  used  in  a 
larger  context  (such  as  bootstrapping). 

*  1.  Basic  or  descriptive  statistics:  mean,  variance,  etc. 

*  2.  Regression 

*  3.  Correlation 

4.  Analysis  of  variance  (ANOVA) 

5.  Categorical  and  discrete  data  analysis 

*  6.  Nonparametric  statistics 

7.  Tests  of  goodness  of  fit,  significance,  and  randomness 

*  8.  Time  series  analysis  and  forecasting 

*  9.  Covariance  structures  and  factor  analysis 

*  10.  Discriminant  analysis 

*  11.  Cluster  analysis 

*  12.  Survival  analysis,  life  testing,  and  reliability 

13.  Multidimensional  scaling 

*  14.  Density  and  hazard  estimation 

IS.  Probability  distributions  function  and  inverses 

*  16.  Random  number  generator 

17.  Mathematical  operations 

a.  Linear  systems 

b.  Eigensystem  analysis 

c.  Interpolation,  approximation 

*  d.  Integration,  differentiation 
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*  e.  Differential  equations 

t  Transforms 

g.  Nonlinear  equations 

*  hu  Optimization 

*  L  Basic  matrix,  vector  operatmns 

*  18.  Exploratory  data  analysis. 

The  majority  of  the  large  statistical  software  suppliers  offered  a  wide  range  of  algorithms  for 
each  heading.  Lacing  in  their  products,  however,  for  a  variety  of  reasons,  were  the  algorithms  that 
stressed  computation-intensive  methods  and  Bayesian  methods. 

2  J.1  Types  of  Statistical  Packages 

There  are  two  major  classes  of  statistical  packages:  procedure-based  and  application  language- 
based.  The  procedure-based  packages  provide  a  wide  selection  of  different  statistical  routines,  which 
may  be  selected  from  a  menu.  The  application  language-based  packages,  on  the  other  hand,  provide 
a  highly  flexible  language  which  permits  the  user  to  specify  more  complex  computation  pnx^ures. 
This  classification  is  not  always  clear  cut  and  a  mix  of  both  features  is  available  in  many  of  the 
statistical  program  packages. 

Currently,  most  of  the  procedure-based  statistical  program  packages  are  coded  in  the 
FORTRAN  language,  while  the  more  recent  language-based  systems  use  the  C  language. 

Procedure-based  Packages.  Typical  examples  of  commercially  available  procedure-based 
statistical  programs  include  SAS  [Jaflfe  1989],  BDMP  (BDMP  1985],  Minitab  [Ryan  1985],  SPSS, 
Statgraphics,  and  Systat  [Fridlund  1990]. 

There  are  several  advantages  of  using  procedure-based  packages. 

o  Many  have  been  proven  reliable  from  their  origin  as  mainframe  packages  to  their  present 
form  on  PC’s. 

o  Based  on  their  longevity,  many  also  have  a  large  installed  user  base, 
o  Procedure  based  packages  feature  fast,  compiled  modules. 

o  The  PC  version  packages  typically  feature  model  setup  and  are  often  menu-oriented, 
o  Most  are  easy  to  learn. 

There  are  also  disadvantages  of  procedure-based  packages. 

o  They  are  not  as  flexible  as  a  language-based  package  because  computations  are  limited  to  the 
routines  available. 

o  Procedure-based  packages  are  seldom  highly  interactive  (this  is  traceable  to  mainframe 
origins). 

o  They  often  have  limited  graphics  capabilities  (also  traceable  to  mainframe  origins). 

The  latter  two  disadvantages  are  sure  to  evolve  with  time  as  programs  become  more 
interactive  in  nature.  Flexibility  has  improved  with  the  addition  of  user-written  BASIC  syntax 
routines  for  Systat  and  APL  for  Statgraphics  [Fridlund  1990]. 
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An  important  point  to  consider  is  that  speed  can  be  substantially  degraded  if  data  is  kept  on 
disk  (Systat  and  SPSS)  rather  than  memory  (Minitab  and  Statgraphics).  The  principal  reason  for 
using  disk  access  is  for  handling  large  amounts  of  data.  If  the  calculations  are  not  time-demanding, 
disk  access  is  preferred  to  save  memory  space.  However,  if  the  data  sets  are  small  with  many 
computations  involved  (typical  of  bootstrapping,  etc.)  disk  access  will  slow  computations  down  by 
orders  of  magnitude. 

Language-based  Packages.  The  language-based  statistical  program  packages  use  an 
application-oriented  higher  level  language  to  control  data  and  processing  operation  selection.  A 
typical  example  of  this  approach  is  the  S  language. 

Statistical  programs  which  use  a  higher 
level  language  have  many  advantages  over  the 
conventional  procedure-based  programs.  They 
are  particularly  well  suited  for  interactive 
applications,  because  they  permit  easy 
generation  of  macros  for  repetitive  operations 
(see  Figure  2.2).  Furthermore,  language-based 
package  have  several  advantages: 

o  They  have  great  flexibility  in  that  the 
output  from  one  module  can  be  used  as 
an  input  to  another  module. 

o  They  are  user  extendable.  The  S 
language,  in  particular,  has  the 
capability  to  add  user  developed 
m^ules.  Since  these  modules  can  be 
controlled  by  the  language  control 
statements,  overhead  associated  with  the 
custom  development  of  programs  is 
reduced. 

o  The  user  has  control  over  many  more  of  the  details,  methods,  and  assumptions  used  in  an 
algorithm.  Several  reviewers  have  noted  that  it  is  not  too  wise  to  put  too  much  trust  in  a 
procedure-based  package  due  to  the  poor  methods  and  assumptions  often  used  [Dalial  1988, 
Searle  1989]. 

There  are  also  disadvantages  to  using  the  language  based  programs.  Most  of  the  statistical 
languages  are  relatively  complex  due  to  the  large  number  of  commands  and  options  available  and 
poor  user  interface.  However,  once  the  user  becomes  familiar  with  the  language,  much  greater 
efficiency  can  be  attained. 

The  distinction  between  the  language  and  procedure-based  programs  is  not  as  apparent  as 
it  once  was,  primarily  because  the  makers  of  procedure-based  packages  have  included  extensions  for 
user-modifiable  programs. 


Figure  2.2  Progression  in  building  a  language 
based  application.  Alongside  is  shown  the 
importance  of  macro-driven  and  compiled 
programs. 
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23.2  Representative  Examples 

Analysis  of  the  existing  statistical  programs  can  provide  guidance  for  the  statistics  workstation 
development  Appendix  G  provides  an  overview  of  those  features  which  are  currently  included  in 
standard  packages  and  which  should  be  included  in  the  future  packages.  To  ease  the  learning  of  the 
statistics  workstation  enviroiunent,  those  familiar  concepts  which  are  widely  used  in  statistical  analysis 
should  also  be  incorporated. 

2.4  Additional  Tools 

Besides  the  programming  languages  and  statistics  packages  required  by  the  users,  additional 
tools  for  data  base  management  (such  as  spreadsheets)  and  more  complex  mathematics  are  often 
needed.  These  are  also  listed  in  Appendix  G. 


2.5  Incorporation  of  Statistical  Tools  into  the  Workstation  Environment 

The  concepts  discussed  in  this  section  were  used  in  planning  the  proposed  low-cost,  high¬ 
speed  statistics  workstation.  In  addition  to  the  computational  methods  involved,  complementary  tools 
such  as  the  artificial  intelligence/expert  system  u^  in  Statistical  Navigator™  may  be  valuable  in 
guiding  the  user  through  the  available  statistical  algorithms  [Brent  1989]. 

Based  on  the  current  user  needs,  the  long  term  objective  of  the  project  will  be  to  develop  a 
statistics  workstation  capable  of  providing  the  following: 

o  Fast  and  interactive  environment  for  lengthy  problems  and  the  management  of  large  data 
files. 

o  Excellent  color  graphics  display  and  windowing  capabilities  to  display  the  data,  the  analysis 
results,  and  to  stimulate  intuition. 

o  User  friendly  environment  to  encourage  the  widest  use  of  statistical  analysis  and  forecasting 
techniques  by  researchers  from  many  diHerent  disciplines. 

The  algorithms  and  concepts  that  are  promising  for  DSP  use  will  be  discussed  in  the  next 
section.  The  emphasis  will  be  placed  on  those  applications  that  require  large  computational  effort. 
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3  POTENTIAL  STATISTICAL  APPLICATIONS 

The  statistical  applications  that  require  high  computation  rates  are  numerous.  Even  the 
applications  that  seem  computationally  simple  at  first  can  become  slowed  down  if  more  parameters 
are  added  (factorial  growth)  or  if  larger  data  sets  are  created.  It  is  the  purpose  of  this  section  to 
isolate  the  computationally-intensive  applications.  Wherever  possible,  we  have  identified  similar 
applications  in  signal  processing  where  DSP’s  have  been  used  in  the  past. 

3.1  Analysis  of  Statistical  Algorithms  and  Techniques 

We  first  present  a  brief  discussion  of  the  major  groups  of  algorithms  used  in  statistical 
analyses.  In  this  section,  emphasis  is  placed  on  computationally  intensive  methods  and  their  special 
computing  requirements. 

3.1.1  Basic  and  Descriptive  Statistics 

Basic  statistics  such  as  mean  and  variance  normally  do  not  require  much  computation  effort. 
However,  in  those  instances  where  population  resampling  is  attempted  (such  as  bootstrapping  for 
standard  error  estimation  or  as  a  Monte  Carlo  analj^is)  the  load  will  increase  by  the  number  of 
simulations  attempted.  For  this  reason,  it  is  important  that  these  algorithms  be  optimized  for  speed. 

3.1.2  Regression 

Regression  methods  typically  require  matrix  manipulations.  For  least  squares  regression  of 
linear  data  sets,  the  calculation  of  pseudo-inverses  is  necessary.  There  are  many  algorithms  for 
dealing  with  this  task  including  singular  value  decomposition,  SWEEP  operator,  etc.  [Maindonald 
1984,  Kennedy  1980].  These  algorithms  do  not  typically  require  inordinate  amounts  of  computer  time 
by  today’s  standards.  However,  when  larger  data  sets  and  resampling  techniques  [Robinson  1987]  are 
used,  the  computation  time  may  become  prohibitive. 

Regression  analysis  involves  heavy  use  of  sum  of  squares.  Nonlinear  transformations  in 
regression,  such  as  exponential  or  logarithmic,  involves  evaluation  of  a  power  series.  For  nonlinear 
regression  problems,  iteration  may  be  required  for  finding  global  minimum.  Therefore,  for 
multidimensional  data  sets  or  those  with  many  local  minimum,  the  computation  time  can  become 
lengthy.  Efficient  software  for  multiple  regression  requires  optimization  of  the  computing  sequence 
and  the  indexing  of  variables. 

Projection  pursuit  regression  [Friedman  1974,  Jones  1987]  is  a  nonlinear  exploratory  data 
analysis  technique  that  typically  may  include  many  Gauss-Seidel  iterations  to  arrive  at  an  optimum 
condition  [Thisted  1988].  Brmtstrapping  on  top  of  projection  pursuit  will  makes  it  even  more 
computationally  intensive  [Efron  19^]. 
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3.U  Correlation 


Calculation  of  correlation  and  covariance  becomes  computationally  intensive  if  techniques 
such  as  bootstrapping  are  applied.  Diaconis  [1983]  demonstrates  the  application  of  bootstrapping  to 
computing  standard  errors  on  a  correlation  coefficient  and  discusses  the  increase  in  computation  time. 

The  computation  of  bivariate  correlation  is  similar  to  the  computation  of  convolution  in 
electronic  signal  analysis.  Efficient  DSP  algorithms  for  computing  convolution  exist  and  can  be 
modiOed  to  handle  bivariate  correlation. 

3.1.4  Analysis  of  Variance 

Analysis  of  variance  (ANOVA)  requires  sum  and  sum-of-squares  evaluations.  Ordinarily,  this 
method  presents  little  computational  load  to  the  PC.  However,  as  the  number  of  factors  increase, 
the  computational  load  will  increase.  Applying  randomization  tests  to  ANOVA  will  also  increase  the 
computational  load  [Noreen  1989].  In  these  cases,  DSP’s  are  ideal  for  calculating  the  sums  and  sum- 
of-squares  evaluation  within  the  ANOVA  algorithm. 

3.1.5  Categorical  and  Discrete  Data  Analysis 

Categorical  and  discrete  data  analysis  [Santner  1989]  are  often  adequately  managed  by  existing 
computers.  In  general,  ordinal  and  nominal  data  (qualitative)  is  better  suited  for  integer  processors, 
whereas  ratio  and  interval  data  (quantitative)  is  suited  for  floating  point  processors  such  as  a  DSP. 
As  an  example  of  the  latter  case,  combinatorial  problems  in  discrete  data  analysis  may  require  discrete 
Fourier  transforms  for  computing  distributions  [Thisted  1988]. 

3.1.6  Nonparametric  Statistics 

If  the  sampled  population  is  not  normal  or  if  there  is  concern  about  "outlier"  observations, 
then  nonparametric  techniques  must  be  used.  The  conventional  nonparametric  procedures  include 
the  well-known  sign  tests  and  rank  procedures.  However,  many  of  the  these  tests  have  been 
introduced  before  the  advent  of  computers  and  were  designed  to  simplify  arduous  hand  calculations. 
More  recently,  a  number  of  new  nonparametric  statistical  techniques,  such  as  shuffling,  have  been 
introduced  which  require  substantial  computer  support  and  are  often  referred  to  as  nonparametric 
computation-intensive  statistical  methods.  These  techniques  will  be  more  suitable  for  the  statistics 
wor^tation  application. 

3.1.7  Tests  of  Goodness  of  Fit,  Significance,  and  Randomness 

Even  though  these  tests  may  look  formidable  in  their  use  of  integrals  and  series 
approximations,  they  are  not  considered  computationally  intensive.  For  example,  when  running  a 
simulation  experiment,  the  significance  testing  will  only  be  done  once  at  the  end  of  the  trial.  The 
computer  time  involved  in  calculating  the  test  will  be  negligible  compared  to  that  involved  in  the 
simulation. 
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3.1.8  Time  Series  and  Forecasting 

Time  series  analysis  includes  autocorrelation,  moving  averages,  cross  correlation,  and  spectral 
analysis.  Smoothing  techniques  used  in  statistical  forecasting  are  similar  to  techniques  u^  for 
electronic  signal  Gltering  in  signal  processing  applications.  These  operations  are  typically  rich  in 
floating  point  array  calculations. 

Since  DSP’s  were  originally  developed  for  signal  processing  applications,  highly  efficient 
algorithms  already  are  available  as  part  of  the  standard  DSP  libraries  provided  by  the  manufacturers 
of  these  devices.  A  number  of  statistical  techniques  have  direct  counterparts  in  signal  processing. 
Once  this  relationship  is  recognized,  DSP  algorithms  can  be  used  either  directly  or  with  minor 
modifications.  Two  such  mappings  are  illustrated  below. 

o  AR  (autoregression)  ->  IIR  (infinite  impulse  response  filter) 
o  MA  (moving  average)  -  FIR  (finite  impulse  response  filter) 

Adaptive  filtering  techniques  are  used  to  determine  the  optimum  sets  of  weights  to  be  used 
in  forecasting  models.  The  specific  operations  involve  autoregression,  moving  average,  and 
autoregressive  moving  average  (ARMA).  Less  work  has  been  done  on  adapting  these  operations  to 
DSP  chips.  One  promising  area  that  is  computationally  intensive  is  bootstrapping  of  an  AR  model. 
This  technique  is  used  to  obtain  variability  of  coefficients  when  the  signal  is  obscured  by  iid  noise. 

An  adaptive  signal  processing  algorithm  often  used  for  forecasting  is  Kalman  filtering.  In  this 
method,  the  emphasis  is  on  prediction  as  more  observations  are  obtained  [Gelb  1974].  The  Kalman 
filter  algorithm  involves  computation  of  means  and  covariance  matrices.  Since  the  DSP  architecture 
supports  efficient  use  of  these  operations,  the  DSP  can  be  used  for  a  wide  range  of  different  Kalman 
filtering  algorithms  [Alexander  1986). 

3.1.9  Covariance  Structures  and  Factor  Analysis 

Factor  analysis  in  general  requires  numerical  linear  algebra.  Finding  the  principal  components 
of  a  multivariate  data  set  is  one  method  of  investigating  the  covariance  structure.  This  requires  sum 
of  squares  and  matrix  operations  which  can  become  computationally  intensive  when  placed  in  a  larger 
loop  such  as  is  required  for  bootstrapping. 

3.1.10  Discriminant  Analysis 

Fisher’s  linear  discriminant  is  one  example  of  discriminant  analysis.  A  probabilistic  neural  net 
which  has  foundations  in  discriminant  analysis  and  Bayesian  decision  making  has  recently  been 
proposed  [Specht  1990].  Neural  networks  often  rely  on  arithmetic  operations  between  all  the 
elements  in  an  array  (connectionist  model)  which  can  require  more  processing  power  than  is  available 
on  a  typical  PC. 

3.1.11  Cluster  Analysis 

Cluster  analysis  techniques  such  as  the  K-means  algorithm  [Hartigan  1985]  may  require 
computation  of  Euclidean  distances  to  distinguish  sets  of  data.  Bootstrapping  to  denote  measures 
of  uncertainty  in  the  classification  can  lead  to  very  long  computation  times  [Jain  1987].  Clustering 
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techniques  are  often  related  to  image  processing  applications  such  as  pattern  recognition, 
classification,  and  scene  analysis  [Duda  19^].  Recently,  there  has  been  much  effort  in  applying 
DSP’s  to  such  applications  [Fuccio  1968]. 

3.1.12  Survival  Analysis,  Life  Testing,  and  Reliability 

The  Kaplan-Meier  estimate  is  an  example  of  a  non-parametric  maximum  likelihood  estimator 
of  reliability  or  survival.  When  used  in  the  context  of  bootstrapping  or  factorial  simulation  the 
computation  times  can  become  lengthy  [Efron  1986,  Grier  1988]. 

3.1.13  Multidimensional  Scaling 

This  is  a  technique  for  reducing  dimensionality  and  graphically  displaying  a  complicated  data 
set  Array  type  floating  point  calculations  are  needed  here,  making  it  particularly  suitable  for  DSP 
applications  if  fast  interactive  display  is  needed. 

3.1.14  Density,  Hazard,  and  Nonlinear  Estimation 

The  computation  complexity  of  the  kernel  method  for  density  estimation  can  be  improved  if 
techniques  such  as  the  FFT  are  used  [Silverman  1986].  Reducing  the  standard  error  through  cross- 
validation  fitting  adds  another  layer  to  the  complexity. 

3.1.15  Probability  Distribution  Function  and  Inverses 
-  and  - 

3.1.16  Random  Number  Generator 

The  above  two  categories  often  go  hand  in  hand.  For  bootstrapping  and  Monte  Carlo 
simulation,  high-quality  pseudo-random  number  generators  [Gleason  1988]  and  accurate  probability 
density  function  inverses  are  important. 

For  Bayesian  computations,  integration  in  multiple  dimensions  is  most  effectively  handlol  by 
Monte  Carlo  techniques.  In  the  majority  of  cases,  the  computation  rate  will  be  limited  by  the  fast 
production  of  random  numbers  [Berger  198S].  Therefore,  it  is  important  to  speed  this  computation 
as  much  as  possible. 

3.1.17  Mathematical  Operations 

The  following  is  a  list  of  supporting  mathematical  techniques  for  statistical  computations. 

a.  Linear  systems 

b.  Eigensystem  analysis 

c.  Basic  matrix,  vector  operations 

The  LINPACK  class  of  problems  falls  under  the  above  three  categories.  In  many  of  these 
algorithms,  accuracy  in  calculations  is  of  prime  importance. 
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(L  Interpolation,  approximation 

e.  Iittegration,  differentiatioru 

EfGcient  DSP-based  algorithms  can  be  developed  for  numerical  integration  and  for  differential 
equation  solution.  The  DSP  can  also  be  efficiently  used  in  Monte  Carlo  integration  schemes.  The 
numerical  integration  will  be  particularly  important  for  Bayesian  computations. 

f.  Differential  equations. 

g.  Transforms 

Markov  analysis  includes  both  discrete  and  continuous  applications.  The  discrete  case 
requires  matrix  multiplication,  whereas  the  continuous  case  requires  solution  of  differential  equations. 
Both  of  these  computations  can  be  efficiently  implemented  using  the  DSP. 

h.  Nonlinear  equations 

L  Optimization 

Optimization  will  arise  in  many  of  the  iterative  techniques,  including  projection  pursuit, 
maximum  likelihood,  and  least-absolute-deviations  regression. 

3.1.18  Exploratory  Data  Analysis. 

The  objective  of  exploratory  data  anal)'ses  is  to  extract  as  much  information  as  possible  from 
a  relatively  limited  data  set  and  help  the  user  gain  insight  by  presenting  the  information  graphically 
[Cleveland  1988,  Young  1989].  This  typically  involves  techniques  such  as  rotations  and  data 
smoothing. 

As  an  example,  the  DSP  is  well  suited  for  data  smoothing.  Most  of  the  needed  algorithms 
are  well  known  and  have  been  optimized  for  DSP  use.  Techniques  are  available  which  permit 
expressing  interpolation  splines  as  digital  filtering  algorithms  [Schaffner  1981).  Since  this  approach 
reduces  the  need  for  division,  an  efficient  coding  of  a  spline  calculation  is  feasible.  The  high  speed 
smoothing  capability  will  permit  many  of  the  filtering  operations  to  be  performed  in  a  real  time, 
interactive  environment. 

3.2  Computation-Intensive  Algorithms 

In  the  last  ten  years,  the  emphasis  in  statistical  research  has  shifted  to  computationally- 
intensive  techniques  and,  in  particular,  to  the  development  of  efficient  algorithms.  These  techniques 
include  nonparametric  estimation  techniques,  such  as  bootstrapping  and  jackknifing  [Efron  1983]. 
Other  time  consuming  computation  techniques  include  simulation  experiments  with  various 
combinatorial  testing  procedures,  projection-pursuit  regression  using  Gauss-Seidel  iteration  [Thisted 
1988],  and  numerical  quadrature  for  multivariate  integrals.  Major  emphasis  has  also  been  placed  on 
the  use  of  computers  for  exploratory  data  analysis  and  using  computer  graphics  in  multivariate  data 
analysis  (such  as  MacSpin''^).  The  supporting  computations  required  in  these  analysis  include 
interpolation  splines,  polynomial  evaluation,  least-squares  data  fitting,  solution  of  nonlinear  equations, 
optimization,  random  number  generation,  etc. 
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The  statistical  analysis  methods  selected  for  a  detailed  investigation  include  statistkal 
techniques  used  with  bootstrapping  (such  as  correlation,  regression,  and  time-series  prediction)  and 
also  iterative  techniques  (such  as  projection  pursuit). 

3J.1  Bootstrapping  and  Resampling 

Bootstrapping  [Efron  1982]  is  a 
resampling  scheme  which  allows  estimation  of 
variance  and  permits  computation  of  confidence 
intervals.  Bootstrapping  makes  no  specific 
distribution  assumptions  and  can  be  likened  to 
a  simulation  procure  that  generates  data 
samples  from  the  given  empirical  distribution. 

It  is  useful  in  those  situations  where  a  limited 
amount  of  data  is  available  or  in  simulation 
studies  where  it  takes  longer  to  generate  output 
from  the  simulation  program  than  to  resample 
[Lewis  1989]. 

Jackknifrng  and  cross-validation  are 
related  techniques  used  to  reduce  bias  and 
estimate  variability  and  calculating  confidence 
intervals.  Confidence  intervals,  in  particular, 
have  been  shown  to  require  many  bootstrap 
samples  (>1000),  making  these  techniques  even  more  computation-intensive  [Efron  1990]. 

In  addition  to  these  techniques,  shuffling  is  often  used  for  testing  randomness  of  populations, 
while  Monte  Carlo  simulation  are  often  used  to  derive  statistics  from  an  assumed  distribution. 

Figure  3.1  schematically  describes  several  of  the  techniques.  Open  and  closed  circles 
represent  two  distinct  data  populations.  Shuffle  techniques  create  a  randomized  test  set  by  mixing 
the  two  populations  together.  A  Monte  Carlo  test  randomly  draws  samples  from  the  probability 
distributions  of  the  two  populations.  Bootstrapping  creates  an  artificial  sample  from  a  single 
population  by  randomly  choosing  points  with  replacement. 

3.2,2  Projection  Pursuit  (PP) 

Projection  pursuit  regression  (PPR)  is  a  form  of  multiple  nonlinear  regression  which  is  used 
for  constructing  a  model  for  a  response  variable  as  a  nonlinear  function  of  a  collection  of  predictors 
[Thisted  1988].  The  projection  pursuit  methods  involve  both  smoothing  and  optimization  and  can 
be  applied  to  regression  and  clustering  [Friedman  1974]. 

As  applied  to  regression,  smoothing  is  normally  accomplished  using  splines,  with  the 
optimization  step  requiring  nonlinear  Gauss-Seidel  iterations.  A  PPR  computation  may  involve  one 
complex  iteration  vrithin  another  iteration.  Each  of  the  iterations  may,  in  turn,  involve  multivariate 
functions.  This  can  lead  to  very  long  computation  times. 

Since  projection  pursuit  is  mainly  used  for  exploration  of  regression,  clustering,  and  density 
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Figure  3.1  Resampling  techniques  using  random 
number  generation. 
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estimation,  a  very  high  computational  accuracy  is  not  needed.  Therefore,  the  projection  pursuit 
methods  appear  ideal  for  DSP  use. 

3.2  J  Bayesian  Analysis 

Bayesian  decision  theory  requires  the  specification  of  a  loss  function.  In  this  case  the 
optimum  decision  is  the  one  which  minimirgs  the  expectation  of  this  loss.  Computation  of  this 
expectation  involves  obtaining  the  posterior  distribution  based  on  current  observations. 

Many  of  the  Bayesian  analysis  applications  require  complex  multivariate  integrations.  In  low- 
dimensions,  integration  techniques  such  as  Simpson’s  rule  and  Gaussian  quadrature  can  be  used. 
Multivariate  integration  at  high  dimensions  should  use  Monte  Carlo  based  numerical  integration 
techniques  [Press  1989].  It  is  important  to  note  that  integrating  through  higher  dimensions  results 
in  increasingly  slower  convergence  times  [Plant  1989]. 

3.2.4  Iterative  Techniques 

Iterative  techniques  play  a  major  role  in  computation-intensive  statistical  analyses.  Many 
illustrative  examples  were  presented  earlier,  such  as  the  PPR,  where  we  found  nested  iterations.  The 
use  of  iterative  techniques  are  routinely  required  when  nonlinear  problems  in  statistics  are 
encountered. 

Two  iterative  techniques  considered  include  simultaneous  over  relaxation  (SOR)  and  iterative 
matrix  pseudo-inversion  (MacKay’s  algorithm).  SOR  finds  applications  in  solutions  of  differential 
equations,  while  MacKay’s  algorithm  [MacKay  1981]  is  a  variation  on  the  iterative  solution  to  finding 
a  pseudo-inverse  to  a  matrix. 

Most  of  the  iterative  matrix  inversion  routines  require  a  fairly  large  number  of  iterations 
before  converging  [Phipps  1986].  Although  they  are  slow,  they  offer  advantages  of  increased 
accuracy,  which  is  particularly  important  in  DSP  applications  where  only  single  precision  capability 
is  currently  available. 

3.3  Objectives  for  Algorithm  Development  and  Evaluation 

As  the  preceding  applications  may  require  much  computer  time,  there  is  a  need  to  handle 
these  computations  in  a  cost-effective  way.  Therefore,  a  number  of  the  above  techniques  and 
algorithms  were  selected  to  guide  the  conceptual  design  and  provide  a  basis  for  the  feasibility 
evaluation  of  the  proposed  workstation. 

We  cannot  expect  that  all  of  the  developed  algorithms  will  exhibit  a  substantial  improvement 
in  the  speed.  One  of  the  sub-tasks  was  to  identify  computation  bottleneck  areas  and  to  identify  other 
promising  solutions  using  modified  algorithms  or  additional  hardware. 

Emphasis  was  also  placed  on  commonality  and  reusability  aspects  of  the  algorithms  to  reduce 
memory  requirements  and  complexity.  Since  there  is  much  commonality  between  the  statistical 
analysis  and  forecasting  techniques  and  modem  signal  processing,  the  algorithm  optimization  process 
buil^  upon  the  existing  knowledge. 
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Accuracy  of  computation  results  will  depend  not  only  on  the  computation  mode  (single  or 
double  precision),  but  also  on  the  specific  algorithm  select^.  Optimal  results  are  obtained  by  a 
careful  balance  of  algorithms  and  computation  mode.  Unfortunately,  the  complexity  of  the  algorithms 
seldom  permits  a  direct  estimation  of  computation  accuracy.  In  this  situation,  subjective  evaluation 
of  the  results  may  be  necessary. 
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4  BASIC  LOW-LEVEL  STATISTICAL  SUBROUTINES 


Most  computation-intensive  statistical  algorithms  can  be  subdivided  into  a  hierarchy  of  levels. 
An  example  of  the  hierarchy  for  one  application  is  shown  in  Figure  4.1.  The  hierarchy  is  arranged 
such  that  the  amount  of  time  that  the  algorithm  spends  in  any  one  level  increases  from  top  to  bottom. 
The  lowest  level  consists  mainly  of  array  computations  that  are  performed  many  times.  These  are 
denoted  by  the  terms  BLAS  and  BSAS. 


4.1  Low-level  Algorithms 


To  achieve  the  largest  possible  speedup 
in  any  application,  the  lowest  level  algorithms 
need  to  optimized.  We  will  consider  this 
optimization  from  a  DSP  perspective. 


4.1.1  Low-level  Structure 


To  exploit  the  DSP’s  features  of 
pipelining  and  parallel  processing  efficiently,  the 
proper  matching  of  computation  algorithms  to 
the  hardware  structure  is  required.  This 
matching  is  a  very  difficult  task  to  accomplish  in 
a  higher  level  language  alone  because  these 
languages  seldom  provide  the  hardware  dependent  features. 

One  approach,  which  has  been  used  with  success  in  supercomputer  programming,  is  to  use 
a  multilevel  structure  in  developing  the  software  modules.  The  lowest  level  of  these  modules  are 
developed  to  include  all  of  the  processor  dependent  details  required  to  achieve  the  expected  high 
performance.  This  implies  that  the  lowest  level  must  be  programmed  in  a  machine  dependent 
language. 

An  example  will  be  used  to  illustrate  this  need.  In  a  DSP  the  most  efficient  operation  is  the 
multiply-accumulate  (MAC)  instruction  illustrated  in  Equations  1.1  and  1.2.  In  addition  to  the 
multiplication  and  addition,  this  instruction  also  provides  the  capability  to  advance  index  registers. 
The  presently  available  higher  level  languages  do  not  have  the  capability  to  express  this  operation 
in  a  form  that  could  be  easily  optimized  by  the  compiler.  The  C  language  [Kemighan  1978]  comes 
close  (see  Equation  4.1)  but  does  not  allow  variable  increments  on  the  pointer  indexing,  e.g.  *A+  + 
means  unary  post-increment  to  the  next  element  of  the  array,  whereas  the  DSP  is  capable  of  larger 
offsets  than  1. 

*A++  =  al  +  *B++  •  Eq.(4.1) 

Thus  by  developing  the  lower  level  modules  in  a  machine  dependent  language,  we  can 
guarantee  that  the  performance  at  this  level  will  not  be  compromised.  To  migrate  to  a  different  DSP, 


Figure  4.1  Statistical  algorithm  hierarchy 
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only  these  lowest  levels  need  to  be  changed 
4.U  Basic  Statistical  Aaalysis  SobroatiBCS  •  BSAS 

During  the  development  of  the  low-level  algorithms,  emphasis  was  placed  on  using  the  nKxt 
efficient  parallel  operations,  such  as  the  MAC  Although  many  of  the  statistical  algorithms  for  DSP 
applications  will  have  to  be  modified  or  developed  anew,  careful  examination  of  the  digital  signal 
processing  algorithms  for  possible  adaptation  to  statistical  problems  will  save  development  time. 

Since  statistical  procedures  involve  matrix  multiplications,  sum  of  squares,  and  other  similar 
floating  point  operations,  we  can  expect  that  the  use  of  DSP’s  will  accelerate  these  computations. 
As  an  example,  several  basic  low-level  building  blocks  can  be  written  in  pseudo-code  to  highlight  their 
DSP  use  (compare  to  Equation  (1.1)): 

o  Summation  of  series: 

s[0]  =  0;  s[k]  =  s[k-l]  +  x(kj;  1  ^  k  i  n 

o  Summation  of  squares: 

s[0]  =  0;  s[k]  =  s[k-l]  +  x[k]^  1  s  k  s  n 

By  definition,  the  basic  building  blocks  will  be  those  level  1  operations  that  are  conunon  to 
most  statistical  analysis  applications.  This  includes  basic  linear  algebra  operations,  as  well  as 
operations  that  are  unique  to  statistical  analyses.  The  linear  algebra  operations  consist  of  vector  and 
matrix  manipulations,  polynomial  evaluation,  and  data  transformation.  The  statistical  operations 
include  computation  of  mean  and  variance. 

The  linear  algebra  subroutines  used  in  this  study  were  modeled  after  the  Basic  Linear  Algebra 
Subroutines  (BLAS),  which  consist  of  the  commonly  used  vector  operations  in  linear  algebra  (see 
Figure  4.2).  A  detailed  description  of  these  subroutines  is  presented  in  the  LINPACK  user’s  guide 
[Dongarra  1979]. 


SCOPY 

Copies  array  X  onto  Y. 

SSWAP 

Swaps  array  X  with  array  Y. 

SSCAL 

Scales  array  X  by  floating  point  value  A. 

SAXPY 

Multiply  array  X  by  constant  and  add  to  Y,  store  in  Y. 

SDOT 

Inner  product  of  X  and  Y. 

SNRM2 

Norm  of  X 

SASUM 

Absolute  value  norm  of  X 

ISAMAX 

Returns  index  of  maximum  (i.e.  Mode). 

SROTG 

Converts  vector  to  Givens  sine  and  cosine  projections. 

SROT 

Givens  rotation. 

Figure  4.2  BLAS  routines  (single-precision). 


The  Basic  Statistical  Analysis  Subroutines  (BSAS)  were  designed  using  a  similar  approach  (see 
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Figure  4.3).  The  identification  and  optimization  of  these  basic  building  blocks  is  necessary  to  achieve 
high  computing  speed,  as  many  recent  studies  have  shown  [Harrod  1987,  Bates  1987]. 

4.U  Implemeiitatioa  Approach 

A  different  statistical  algorithm  selection  and  optimization  strategy  is  required  for  the  DSP- 
based  workstation  given  that  the  processor  can  perform  a  multiply-accumulate  (MAC)  operation  in 
one  instruction  cycle  time.  Most  of  the  previous  optimization  criteria  were  ba^  on  minimization 
of  multiplications  at  the  expense  of  extra  additions.  Thus,  the  new  optimization  criteria  favors  MAC 
operation  and  concurrent  address  updates  to  minimize  the  cycle  time  of  instruction  cycles  needed  to 
perform  a  specific  task. 

The  algorithm  selection  should  consider  the  available  DSP  operations,  loop  control,  and 
automatic  memory  pointer  indexing.  In  ^pendix  A,  we  show  how  building  blocks  such  as  summation 
of  series,  summation  of  squares,  polynomial  evaluation,  moving  averages,  and  exponential  smoothing 
are  related  to  the  available  DSP  instructions  and  are  used  to  build  the  BIAS  and  BSAS  library. 

The  basic  building  blocks  may  be  used  to  perform  compound  operations.  For  example,  matrix 
multiplication  is  a  compound  operation  using  the  inner  product  as  one  of  the  basic  building  blocks 
together  in  a  multiple  loop  control. 

More  recently,  extensions  to  the  BLAS  have  been  proposed.  These  include  the  BIAS2 
[Dongarra  1984]  extensions: 


o 

Matrix  x  Vector  Update 

y  -  y  ±  Ax 

o 

Vector  X  Matrix  Update 

x^  -  x^  ±  y^A 

o 

Rank  1  Update 

A  -  A  ±  yx^ 

o 

Triangular  Solver 

X  -T'x 

An  even  more  recent  addition,  BIAS3  [Harrod  1987]  adds  the  following  operations: 

o 

Rank  k  Update 

A  ±  BC 

o 

Matrix  Transpose  x  Matrix 

A±FC 

where  AeR“*‘,  B6R"*N  C6R‘*“,  EeR**",  TeR"*",  xeR",  yeR". 

While  the  original  version  contained  only  the  basic  linear  operations,  the  later  additions 
extended  these  to  vector  x  matrix  and  matrix  x  matrix  operations.  In  each  of  these  cases  a 
substantial  performance  improvement  was  noted.  All  of  these  improvements  were  due  to  the  special 
hardware  features  available  in  the  processor.  The  greatest  improvements,  however,  were  achieved 
at  the  lowest  level. 

4.2  Grouping  of  Low-level  Routines 


The  lowest  level  routines  can  be  grouped  according  to  their  application.  Where  similarities 
exist  to  operations  in  the  library  available  for  our  test  system  (AT&T  DSP32  hardware  and  software), 
these  are  duly  noted. 
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ABSDEV 

ADDCPY 

AODSCAL 

ADDSCALCPY 

ADDVEC 

CDF 

CENTER 

CSUM 

CSUMSQ 

DIST 

EXPSM 

FILL 

FLOATA 

HEAP 

HISTOG 

HORN 

INDEX 

INTA 

LIMIT 

MAC 

MATMULT 

MATMULTl 

MATMULT2 

MATTMAT 

MATMATT 

MATVEC 

MAXA 

MAXIND 

MEAN 

MEDIAN 

MINA 

MININD 

MINMAX 

MOMENT 

PROD 

QABS 

QABSA 

QMAX 

QMIN 

RANK 

SCALCPY 

SIGN 

SIGNA 

SSQR 

SUBVEC 

SUMUNTIL 

TRANSP 

UPDATPROD 

UPDATSQR 

UPDATSUM 

VECMAT 

WDOT 


Returns  the  sum  of  absolute  deviations  of  the  elements  of  array  X  from  value  A. 

Adds  a  scalar  to  the  elmnents  of  an  array  before  copying  to  ariotber. 

Scales  a  vector  and  adds  a  translation. 

Scales  and  translates  a  vector  before  copying  to  another. 

Adds  two  floating  point  vecton. 

Computes  the  cumulative  distributioo  function  of  array. 

Adds  a  floating  point  scalar  to  the  elements  of  an  array. 

Calculates  the  cumulative  sum  of  an  array. 

Calculates  the  cumulative  sum  of  a  imduct  of  a  value  squared  and  anothm^  value. 
Calculates  the  square  of  the  Euclideim  between  two  vectors. 

Filters  an  input  array  using  an  exponential  smoothing  algorithm. 

Creates  an  array  of  floating  point  values  baaed  on  a  starting  value  and  a  stq>  size. 
Converts  an  array  of  intego'  values  to  an  array  of  floating  point  numbers. 

Does  an  in-place  heap  sort  in  ascending  order  on  the  floating  point  array  SX. 

Bins  values  according  to  their  floating  point  magnitude  (floating  point). 

Evaluates  a  polynomial  expression  according  to  Homer’s  algorithm. 

Returns  an  indexing  array  in  ascending  order. 

Converts  an  array  of  floating  point  values  to  an  array  of  integer  numbers. 

Clamps  the  input  value  to  the  upper  or  lower  limit  if  x  does  not  fall  within  its  range. 
Performs  a  mdtiply-accuimilate  on  two  vectors. 

Multiplies  two  matrices  together  and  returns  the  result. 

Multiplies  a  matrix  with  a  transpose  of  a  matrix. 

Multiplies  tranqtose  of  a  matrix  by  a  matrix. 

Multiplies  a  matrix  by  its  transpose. 

Multiplies  \  matrix  by  its  transpose  in  the  reverse  order. 

Multiplies  a  matrix  by  a  vector. 

Finds  the  maximum  value  in  array. 

Finds  the  index  of  an  array,  with  maximum  value. 

Calculates  the  average  value  of  an  array  by  using  a  two  pass  algorithm. 

Returns  the  midpoint  index  of  an  array. 

Finds  the  minimum  value  of  an  array. 

Finds  the  index  of  an  array  with  the  minimum  value. 

Finds  the  minimum  and  maximum  values  within  a  floating  point  array. 

Calculates  the  third  and  fourth  moments  of  a  centered  array. 

Returns  the  cumulative  product  of  an  array. 

Returns  the  absolute  value  of  a  single  argument. 

Converts  all  elements  in  array  to  their  floating  point  values. 

Returns  the  maximum  of  two  floating  point  numbers. 

Returns  the  minimum  of  two  floating  point  numbers. 

Sorts  the  indices  according  to  their  rank. 

Multiplies  the  input  array  by  a  floating  point  scalar  and  then  copies  to  another. 

Transfers  the  sign  of  X  to  Y  and  returns  it. 

Transfers  the  sign  of  values  in  the  X  array  to  Y  array  and  then  copies  to  the  output  array. 
Calculates  the  sum  of  squares  of  a  vector’s  components. 

Subtracts  two  floating  point  vectors. 

Sums  an  array  of  numbers,  returning  the  index  where  the  sum  exceeds  the  set  value. 
Returns  the  transpose  of  a  matrix. 

Accumulates  the  product  of  elements  in  the  X  and  Z  arrays  in  the  Y  array. 

Accumulates  the  square  of  the  elemmts  in  the  X  array  in  the  Y  array. 

Accumulates  the  elements  in  the  Y  array  in  the  Y  array. 

Multiplies  a  matrix  by  a  vector. 

Performs  the  weighted  dot  (inner)  product  between  two  vectors. 


Figure  43  BSAS  routines. 
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4.2.1  Vector  Operations 


Most  of  the  standard  vector  operations  are  easy  to  compute  in  the  DSP.  The 
autoincrementing  capability  permits  very  efficient  vector  addition,  subtraction,  and  multiplication  (see 
e.g.  SAXPY  and  ADDVEC  in  Appendix  A).  The  same  features  also  permit  the  development  of 
efficient  operations  on  matrices  and  other  highly  regular  data  structures. 

4.2.2  Vector/Matrix  and  Matrix  Operations 

Matrix  operations  can  be  considered  two  dimensional  extensions  of  the  basic  vector 
operations.  However,  the  matrix  algorithms  are  more  complex,  mainly  due  to  the  added  indexing 
computation  requirements.  Since  these  operations  are  often  found  in  the  inner  iterative  loops  of 
statistical  computations,  optimization  of  these  operations  is  highly  desirable. 

The  low  level  operations  are  the  basic  matrix  operations  such  as  addition  and  multiplication. 
One  basic  algorithm  included  in  the  AT&T  DSP  support  library  is  MAIMUL,  which  multiplies  two 
matrices  and  places  the  results  in  the  third.  To  work  with  the  available  processor,  slight  modifications 
were  made  to  that  code  (see  MATMULT  in  Appendix  A). 

Higher  level  operations  include  the  more  complex  matrix  operations,  such  as  matrix  inversion. 
The  matrix  inversion  routine  included  with  the  AT&T  library  (MATINV)  uses  Gaussian  elimination, 
which  is  not  as  flexible  a  technique  as  others  available  (e.g.  singular  value  decomposition).  In  some 
cases,  due  to  the  limited  accuracy  of  the  DSP,  iterative  matrix  inversion  techniques  may  be  needed. 

Note  that  a  possible  solution  to  further  speed  improvement  would  be  to  develop  a  more 
complex  arithmetic  unit  in  the  processor,  capable  of  handling  higher  dimension  problems.  In 
particular,  some  of  the  more  recent  graphics  processors  have  highly  efficient  architectures  for 
handling  the  two-dimensional  graphics  display. 

4.23  Polynomial  Evaluation 

In  the  simplest  form  polynomial  evaluation  can  be  represented  as  Horner’s  algorithm  (see 
HORN  in  Appendix  A): 

PIO]  =  a[n];  P[k]  =  a[n-k]  +  x  P[k-l];  1  s  k  s  n 

Most  of  the  DSP-based  polynomial  evaluation  algorithms  use  variations  of  Horner’s  method. 
The  development  of  DSP  algorithms  for  polynomial  evaluation  presents  some  unique  problems.  Due 
to  the  pipelining  effects  in  the  DSP,  optimization  of  the  polynomial  evaluation  algorithm  requires 
folding  of  some  of  the  operations.  Using  this  approach,  execution  time  can  be  reduced  by  as  much 
as  a  factor  of  two. 

An  efficient  subroutine  for  finding  coefficients  of  the  product  of  polynomials  is  also  possible. 
The  same  approach  can  be  extended  to  complex  polynomial  evaluation.  Most  of  the  polynomial  root 
finding  algorithms  are  iterative  and  the  single  precision  limitation  of  the  present  DSP’s  may  limit 
general  application.  The  same  comments  apply  to  finding  the  coefficients  of  a  reciprocal  of  an  array 
and  the  coefficients  of  a  poljmomial  from  roots. 
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4.2.4  Random  Number  Routines 


Two  pseudo-random  number  generators  are  supplied  with  the  AT&T  DSP  library.  The 
generator  for  finding  a  uniform  number  is  adequate,  but  has  a  short  cycle  time  (see  Appendix  B). 
However,  the  generator  for  obtaining  a  variate  from  a  normal  population  is  based  on  the  crude 
technique  of  summing  uniform  variates.  Better  techniques  such  as  the  Box-Muller  method  can  be 
adapted  for  this  [Bratley  1987].  Other  random  number  distributions  needed  for  statistical  applications 
include  exponential,  Poisson,  binomial,  geometric,  gamma,  and  beta. 

Since  these  computations  are  modular,  the  use  of  a  dedicated  DSP  as  a  random  number 
generator  could  be  envisioned  for  many  simulation  or  sampling  problems. 

4.2.5  Signal  Processing  and  Filtering 

The  DSP  was  originally  designed  for  signal  processing  and  filtering  applications.  Spectral 
analysis  and  the  other  techniques  often  use  Fourier  transforms.  The  DSP’s  themselves  are  often 
optimized  specifically  for  FFT  applications.  Many  of  the  current  chips  do  a  multiplication  of  sum  and 
difference  ("butterfly"  operation  in  the  FFT)  in  a  single  instruction. 

Optimum  DSP  FFT  algorithms  are  available.  The  DSP32C  can  compute  a  4096-point, 
complex  FFT  in  20.4  ms  [AT&T  1988].  Typical  optimized  FFT  code  for  a  20  MHz  386-type  PC  will 
take  50  times  as  long  [MATLAB™  Version  3.5]*. 

In  addition  to  the  FFT  algorithms,  there  are  other  filtering  algorithms  well  suited  to  the  DSP. 

Exponential  smoothing.  An  example  of  DSP  coding  for  an  exponential  smoothing  algorithm 
is  shown  below  (see  EXPSM  in  Appendix  A): 

o  F[t-»-l]  =  a  x[t] (l-a)  F[t] 

Moving  averages.  The  use  of  moving  averages  is  another  example  of  filtering.  An  illustration 
of  a  moving  average  (3x3  case)  is  shown  below: 

o  M{t]  =  a[-2]  x[t-2]  +  a[-l]  x(t-l]  -I-  a[0]  x(t]  +  a[l]- x[t-t-l]  -I-  a[2]- x[t-l-2] 

Several  of  these  filtering  operations  are  available  in  the  AT&T  library. 

4.2.6  Statistical  Operations 

Sum  of  squares  and  cumulative  sum  are  examples  of  basic  statistical  operations  (see  SSQR, 
CSUM,  and  MEAN  in  Appendix  A).  These  operations  can  be  very  efficiently  achieved  in  a  DSP. 


*  Stratified  sampling  FFT  computations  are  often  used  to  reduce  spurious  peaks  and  will  take  longer  to  perform 
[Kay  1988]. 
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4.2.7  Math  Operations 

Mathematical  operations  other  than  addition  and  multiplication  cannot  be  performed  in  a 
single  instruction  cycle  and  require  a  series  of  elementary  operations.  However,  many  of  the  basic 
mathematical  operations,  such  as  square  root  and  absolute  value  (see  QABS  in  Appendix  A),  can  be 
optimized  for  speed  within  the  DSP.  A  tradeoff  in  accuracy  can  often  be  made  to  obtain  fewer 
instructions^ 

4.2.8  Sort  Operations 

Although  the  DSP  does  not  support  fast  comparison  operations,  sorting  of  data  sets  can  be 
applied  eBiciently  within  the  DSP  (see  HEAP  in  Appendix  A).  As  these  do  not  require 
multiplications  or  additions,  the  speed  advantage  will  typically  not  be  as  large  as  for  other  operations. 

4.2.9  Scaling  Operations 

Scaling  of  data  sets  and  finding  extrema  can  be  handled  efficiently  within  the  DSP  (see  for 
example  SCALCPY,  SSCAL,  and  MINMAX  in  Appendix  A).  Some  of  these  operations  are  very 
similar  to  the  vector  operations  mentioned  earlier  but  are  used  more  often  in  the  context  of  graphing 
than  linear  algebra. 

4.3  Intermediate  Level  Algorithms 

The  intermediate  level  (level  2)  includes  algorithms  for  the  special  functions  needed  in 
statistical  computations,  such  as  covariance,  correlation,  multivariable  regression  analysis,  maximum 
likelihood  estimation,  spectral  analysis,  smoothing,  adaptive  filtering,  and  forecasting.  These 
operations  routinely  use  the  basic  building  blocks  (BLAS  and  BSAS)  and  therefore  can  be  optimized 
for  DSP  use  by  substituting  the  low-level  routines  where  necessary. 

4.4  Prototyped  Examples 

Several  of  the  computation-intensive  high-level  statistical  algorithms  were  coded  for  use  on 
the  DSP  during  this  effort.  We  do  not  intend  to  give  complete  descriptions  of  the  algorithms  but  to 
demonstrate  how  the  low-level  routines  are  inserted  and  what  changes  need  to  be  made  in  the  overall 
structure  of  the  code. 

The  Phase  I  statistics  workstation  feasibility  effort  examined  several  prototyping  applications. 
These  involved  a  representative  sample  from  several  of  the  areas  of  computation-intensive  statistics. 
The  samples  chosen  for  evaluation  were  : 

1.  Correlation  coefficient  (bootstrapped). 

2.  Multiple  linear  regression  using  SVD  (bootstrapped). 

3.  Autoregressive  model  (bootstrapped). 


'*  For  example,  there  are  several  square  root  functions  available  in  the  AT&T  library,  these  include  sqrtf()  and 
sqrtqO  where  the  extensions  T  and  'q'  indicate  fast  (accurate)  and  quick  (less  accurate). 
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4.  ID  and  2D  projection  pursuit. 

5.  Markov  modeling. 

6.  Iterative  techniques  (MacKay’s  and  SOR). 

7.  Density  estimation. 

8.  Survive  analysis  (Kaplan-Meier  estimate). 

9.  K-means  clustering  (bootstrapped). 

10.  Bayesian  bootstrap  (integration  by  Simpson’s  rule). 

11.  Neural  networks  for  discrimination. 

12.  Euclidean  distance  measurement 

13.  Stochastic  simulation. 

Table  4.1  shows  which  low-level  routines  were  used  in  the  various  algorithms,  i^pendix  A 
gives  the  descriptions  of  the  BLAS/BSAS  routines  for  these  applications.  Each  one  of  the  algorithms 
investigated  requires  significant  computation  both  in  the  number  of  arithmetic  steps  and  the  number 
of  trials  in  a  given  simulated  sample.  In  addition,  good  pseudo-random  number  generation  plays  an 
important  part  in  the  process  (see  Appendix  B). 

As  a  secondary  issue,  less  work  was  focussed  on  signal  processing  and  filtering  applications 
(such  as  the  FFT,  correlation,  convolution,  moving  average,  etc.),  as  these  are  well  known  to  be 
optimum  applications  for  DSP  work.  Similarly,  less  work  was  done  on  creating  distributions,  error 
checking,  etc.  which  would  be  needed  for  a  commercial  version.  In  the  case  of  graphics,  the 
algorithms  could  be  similarly  evaluated  and  optimized. 

We  did  not  consider  the  conventional  (non  computation-intensive)  applications  simply  because 
the  current  power  of  PC’s  are  more  than  sufficient  to  handle  these.  In  the  cases  where  the  initial 
motivation  for  developing  the  statistic  was  to  minimize  the  number  of  computations  (in  the  days 
before  affordable  computers)^  no  attempt  was  made  to  prototype  for  the  DSP. 

Before  going  into  more  of  the  details  of  the  high-level  algorithms,  we  describe  the  approach 
we  have  taken  for  workstation  design  (section  5)  and  algorithm  development  (section  6).  In  section 
8  we  report  on  the  performance. 


^  [Box  1978]  pointed  out  that  some  of  the  nonparametric  tests,  such  as  the  Wilcoxon  test,  were  developed 
specifically  for  hand  calculation. 
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Table  4.1  BSAS/BLAS  usage  chart. 
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5  STATISTICS  WORKSTATION  DESIGN 


The  design  of  a  statistical  workstation  must  meet  the  needs  of  the  users  and  provide  a  cost 
effective  solution  to  their  problems.  As  we  noted  earlier,  using  commercially  available  digital  signal 
processors  to  do  the  bulk  of  the  statistical  computation  can  provide  a  potential  alternative  and  lower 
cost  solution.  The  presently  available  third-generation  DSP’s  include  many  features,  such  as  floating 
point  capability,  high  processing  speeds  (20-40  million  floating  point  operations  per  second 
(MFLOPS)),  a  simple  interface,  and  the  capability  to  support  multiprocessor  operation.  These 
features  make  their  application  to  computation-intensive  statistical  computations  particularly 
attractive. 

Figure  5.1  shows  the  speed  in  floating 
point  operations  per  second  versus  cost  for 
various  supercomputers  [Erisman  1988]  and  the 
proposed  DSP-based  statistics  workstation. 

Straight  lines  show  the  regions  of  equivalent 
ratio  of  computing  speed  per  dollar  invested. 

The  advantage  of  a  low-priced  DSP  workstation 
is  apparent  in  an  application  where  cost, 
immediate  access,  and  an  interactive 
environment  is  more  important  than  an 
extremely  fast  execution.  For  example,  15 
minutes  of  supercomputer  time  may  translate  to 
25  hours  on  a  dedicated  workstation,  but  the 
lower  cost  and  freedom  of  access  to  a 
workstation  would  make  it  more  favorable. 

Thus,  our  design  philosophy  emphasizes 
the  development  of  DSP-based  computation 
algorithms  for  the  basic  statistical  operations  to  Figure  5.1  Price/performance  ratio  for  different 
achieve  a  substantial  improvement  in  speed,  computers  and  the  proposed  statistics  workstation. 
Since  a  similar  approach  had  been  used  earlier 

in  developing  new  algorithms  for  solving  linear  algebra  problems  on  supercomputers,  i.e.  BLAS,  it 
provided  a  good  foundation.  Furthermore,  even  though  the  statistical  algorithms  were  developed  for 
a  specific  DSP,  future  improvements  in  the  DSP  capabilities  will  not  invalidate  most  of  the  algorithms 
developed  because  of  the  basic  operational  similarities  between  DSP’s. 

As  an  alternative,  an  even  higher  processing  speed  could  be  attained  if  the  statistical 
computations  could  be  performed  using  a  custom-designed  VLSI  processor.  However,  the  design  of 
such  a  statistical  processor  would  involve  high  risk  and  cost.  Typically,  the  development  cost  of  a 
high-performance  specialized  processor  has  been  in  the  $5-10  million  range  and  would  be  prohibitive 
for  the  planned  application. 

Yet  another  approach  would  be  to  use  the  currently  available  programmable  devices,  such  as 
the  high-speed  bit-slice  processors  and  the  high-speed  numeric  processors.  These  devices  are 


30 


currently  used  to  build  very  high  speed  digital  processors.  A  major  disadvantage  in  using  this 
approach  would  be  the  high  cost  associated  with  developing  support  software  for  these  devices. 

5.1  Key  DSP-based  Statistics  Workstation  Design  Objectives 

The  long  term  goal  will  be  to  develop  a  flexible,  interactive,  high-speed,  and  low  cost  statistics 
workstation  capable  of  solving  a  wide  range  of  complex  statistical  problems.  The  workstation 
architectural  design  objective  is  to  provide  a  mainframe  or  low-end  supercomputer  speed  in  a  desktop 
system.  This  workstation  will  support  scientiflc  quality  graphics  and  will  provide  a  user-friendty 
interface.  The  goal  was  set  to  be  able  to  perform  statistical  computations  at  least  10  times  faster  than 
on  a  minicomputer  or  100  times  faster  than  on  a  high-performance  PC.  The  estimated  workstation 
hardware  cost  goal  was  set  to  be  below  $10K.  Our  initial  projections  showed  that  this  goal  could  be 
reached  with  the  proper  hardware  and  algorithm  optimization. 

To  reach  this  goal  will  require: 

5.1.1  High  Speed  Floating  Point  Computation 

Although  high  speed  is  provided  by  today’s  supercomputers,  it  is  expensive  to  use  and  is 
difficult  to  access.  When  high  speed  is  required  in  an  interactive  environment,  use  of  the 
supercomputer  must  be  ruled  out  because  of  its  remote  location.  High  speed  in  the  statistics 
workstation  is  achieved  by  selecting  a  fast  DSP  and  optimizing  ail  of  the  key  algorithms.  As 
mentioned  earlier,  this  optimization  will  be  performed  with  respect  to  MAC-like  instructions.  The 
best  candidates  for  processors  are  those  that  are  inherently  parallel  and  use  pipelining  techniques. 
Increasing  only  processor  clock  speed  will  result  in  limited  improvement.  In  addition,  to  reach  the 
desired  performance  level,  memory  speed  must  be  matched  to  the  DSP  speed,  particularly  for 
frequent  accesses.  If  very  large  data  arrays  are  needed,  then  dynamic  memory  may  provide  a  more 
efficient  approach  at  a  slight  reduction  in  speed. 

5.1.2  Low  Cost  Components 

Low  cost  can  be  achieved  only  by  using  low-cost  commercial  parts  that  are  widely  used,  easy 
to  interface,  and  are  reliable.  These  components  include  commercial  DSP  devices,  DSP  boards  and 
widely  available  high-speed  memory.  Due  to  their  abundance  in  communications  systems,  many  of 
the  DSP  chips  cost  less  than  the  currently  available  math  coprocessors.  As  a  result,  the  low  cost  may 
open  up  new  applications  that  are  not  currently  cost-effective  to  perform  using  supercomputers. 

5.U  Compact  Design 

The  compact  design  constraint  means  that  all  of  the  needed  statistics  workstation  hardware 
should  be  provided  on  plug-in  boards  that  can  be  easily  placed  in  conventional  PCs. 

5.1.4  Operational  Versatility 

The  proposed  statistics  workstation  design  does  not  disturb  the  basic  functions  of  the  personal 
computer.  It  will  still  be  capable  of  running  all  of  the  standard  applications  programs,  such  as 
wordprocessing  and  spreadsheet  processing,  in  addition  to  the  new  capabilities.  This  approach  will 
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not  only  reduce  user  cost  but  will  also  simplify  transfer  of  data  between  the  workstation  programs  and 
other  user  programs.  These  capabilities  can  be  easily  incorporated  if  the  basic  architecture  of  the  PC 
is  not  changed.  To  extend  the  range  of  applicability,  improved  access  to  the  standard  statistical 
programs  will  be  provided  in  the  future.  This  access  will  be  via  extended  data  translation  aiul 
export/import  programs. 

Since  the  DSP  can  operate  independently,  parallel  operation  between  the  host  and  DSP  can 
be  achieved,  thus  freeing  the  host  for  other  tasks  such  as  disk  access,  etc.  In  this  mode,  the  DSP 
operation  is  quite  similar  to  a  remote  batch  operation.  By  not  requiring  a  separate  computer  for 
these  tasks,  user  costs  will  be  lowered. 

5.1.5  User  Programmability 

Since  the  statistics  workstation  must  be  capable  of  supporting  a  relatively  wide  range  of 
problems,  a  comprehensive  library  of  statistical  routines  must  be  provided. 

The  capability  to  accept  user  developed  extensions  must  be  made  available.  User 
programming  may  range  from  macros  to  the  incorporation  of  user-developed  statistical  routines.  To 
provide  this  capability,  the  workstation  interface  must  be  clearly  defined  (open  interface  specification) 
and  the  necessary  software  utilities  provided. 

5.2  Proposed  System  Architecture  and  Configuration 

This  section  describes  the  proposed 
statistics  workstation  system  architecture  and 
configuration.  Figure  5.2  shows  an  overview  of 
the  host-DSP  system. 


An  alternate  approach  to  host-DSP 
workstation  design  would  be  to  incorporate  all 
of  the  processing  in  the  DSP,  without  using  a 
separate  host.  There  are,  however,  several 
disadvantages  in  using  this  approach  such  as  the 
development  of  a  new  operating  system,  lack  of 
file  storage  support,  design  of  a  new  user 
interface,  and  others,  making  this  approach 
impractical. 


Figure  5.2  DSP-based  statistical  workstation 
system  architecture. 


5.2.1  Host  Microprocessor 

The  statistics  workstation  design  can  use  any  high-performance  PC  (286, 386, 486,  Macintosh, 
NeXT*)  system  as  a  host.  There  are  few  speed  demands  on  the  host  processor.  However,  a  higher 
speed  processor  is  preferred  because  it  can  provide  faster  data  handling,  including  downloading 


*  The  NeXT  computer  comes  quipped  with  a  Motorola  5600  DSP  as  a  peripheral  processor.  Unfortunately, 
it  is  an  integer  and  not  floating-point  DSP  unit,  thus  limiting  its  applicability. 
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programs.  A  faster  processor  will  also  allow  non-DSP  tasks  to  be  completed  faster  and  thus  improve 
the  overall  performance. 

Many  of  the  non  floating-point  computations,  such  as  integer,  string,  and  logic,  can  be  more 
efficiently  performed  in  the  host  at  a  lower  cost  This  approach  will  reduce  DSP  loading  and  improve 
the  overall  performance. 

5.2.2  DSP 

The  DSP  chip  is  the  key  element  of  the  statistics  workstation.  Therefore  its  selection  is 
critical  to  the  performance  level  achieved.  A  key  factor  entering  in  the  selection  is  the  architecture 
of  the  DSP,  which  in  turn  will  directly  affect  the  operational  speed. 

The  majority  of  the  earliest  DSP’s  were  integer  format  to  gain  the  needed  speed 
improvement.  Statistical  applications,  however,  routinely  require  floating  point  capability  and  thus 
narrows  the  range  of  the  suitable  candidates.  Of  the  presently  available  floating  point  DSP’s  (third 
generation),  the  suitable  candidates  include  AT&T  DSP32  and  DSP32C,  TI 320C30,  NEC  77230,  and 
Motorola  96002.  Detailed  descriptions  of  the  DSP  devices,  which  have  significant  differences  in 
architecture,  is  provided  in  Appendix  D  and  [Hart  1989].  In  addition,  a  recently  available  processor, 
the  Intel  860,  also  supports  some  DSP  operations^ 

As  a  limitation,  the  third  generation  DSP  cannot  easily  support  double  precision  (64  bit) 
computations  (except  for  the  860).  These  processors  operate  in  a  sin^e  precision  mode  (32  bit)  and 
use  40  bit  accumulators  to  achieve  additional  accuracy  and  eliminate  round-o^  errors*.  This 
enhanced  single  precision  accuracy  is  acceptable  for  many  statistical  computations. 

We  can  expect  that  the  fourth  generation  DSP  devices  will  overcome  the  earlier  limitations 
and  support  even  more  complex  operations,  such  as  the  division  and  square  root.  As  of  now,  these 
operations  are  done  in  software. 

DSP  type.  Of  the  available  DSP  devices,  the  best  choices  are  the  AT&T  digital  signal 
processors  DSP32  and  DSP32C.  The  DSP32  is  a  relatively  low  cost  commercial  device,  widely 
available,  and  has  excellent  utility  software  support.  However,  the  DSP32  address  space  presents  a 
limitation  to  problem  size,  because  the  memory  space  is  limited  to  somewhat  less  than  64K  bytes,  due 
to  the  16-bit  address  bus  and  internal  architecture. 

The  more  advanced  DSP32C  has  several  new  instructions  and  a  larger  address  space  because 
it  uses  a  24-bit  address  bus.  Because  of  its  speed  advantages,  the  DSP32C  is  particularly  suitable  for 
a  full-size  statistics  workstation. 

DSP  architecture.  Although  the  individual  DSP’s  differ,  at  the  higher  level  there  is  some 
commonality.  Thus,  every  DSP  contains  two  types  of  processors  (see  Figure  5.3).  One  is  an  integer 


^  It  is  interesting  to  note,  that  the  Intel  860  has  been  marketed  more  as  a  high-speed  processor  than  a  DSP. 
Similarly,  Motorola  advertisements  refer  to  the  96002  as  a  multi-media  processor. 

*  The  Motorola  96002  supports  extended  single  precision  floating  point  computations. 
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processor  (CAU  -  control  arithmetic  unit), 
mainly  used  for  address  and  offset 
computations.  The  other  type  is  a  floating 
point  processor  (DAU  •  data  arithmetic  imit) 
used  for  all  floating  computations.  The  floating 
point  processor  consists  of  two  separate  parallei 
units  -  a  floating  point  adder  and  a  floating 
point  multiplier.  Not  only  do  the  integer  and 
the  floating  point  processors  operate  in  parallei, 
but  so  do  the  adder  and  multiplier.  These 
capabilities  are  achieved  by  using  considerable 
pipelining. 

Intercmmection  and  communications. 

Interconnection  of  the  DSP  board  is  via  the 
host  8,16,  or  32  bit  bus.  This  bus  supports  data  transfer  and  supplies  power  to  the  board.  The  host 
bus  will  be  used  to  download  DSP  programs  and  data,  upload  results,  and  obtain  DSP  status 
information.  In  addition  to  the  PC  bus,  a  separate  serial  port  provides  the  capability  to  connect 
multiple  DSP’s  and  to  access  external  data. 

Communication  between  the  host  and  the  DSP  is  accomplished  via  signal  flags.  Since  the 
DSP  has  its  own  clock  and  can  operate  in  an  asynchronous  mode,  task  synchronization  problems  are 
greatly  reduced,  as  is  the  need  for  interrupts. 

5.2J  Memory 

Global  data  is  data  that  is  used  by  both  the  host  processor  and  the  DSP.  Since  the  DSP  does 
not  have  direct  access  to  the  host  memory,  but  only  to  its  own  memory  (aside  from  using  dual  access 
memory),  data  must  be  transferred  over  the  host  bus,  a  relatively  time  consuming  operation 
(approximately  1  MB/s).  Thus,  global  data  use  should  be  minimized  whenever  possible.  The  use  of 
fewer,  but  larger  data  blocks  results  in  a  more  efficient  operation  through  the  block  transfer  mode. 
Furthermore,  when  considering  the  data  transfer,  we  must  remember  that  the  DSP  may  use  a 
different  data  format  for  the  floating  point  numbers.  For  example,  the  AT&T  DSP32  uses  a  non- 
IEEE  floating  point  format  This  requires  floating  point  conversion  whenever  data  is  moved  between 
the  microprocessor  and  the  DSP.  liiis  is  not  lasting,  however,  as  some  of  the  next-generation  DSP 
designs,  such  as  the  Motorola  96002,  use  the  IEEE  floating-point  format. 

Local  data  is  data  used  exclusively  by  either  the  host  processor  or  by  the  DSP.  Therefore, 
local  data  use  should  be  maximized  to  avoid  the  need  for  data  transfer.  DSP’s  typically  contain 
limited  on-chip  memory,  with  the  bulk  of  the  memory  off-chip.  The  on-chip  memory  is  divided  into 
RAM  and  ROM  sections.  The  RAM  can  be  used  either  for  program  or  data  storage.  The  ROM 
section  may  contain  DSP  subroutines,  trigonometric  constants,  or  it  may  contain  customized  DSP 
programs.  The  code  for  computation-intensive  statistical  algorithms  often  is  relatively  small  and  may 
be  placed  in  the  on-chip  memory  bank.  However,  the  data  files  are  usually  large  and  require  off-chip 
storage. 


Figure  DSP  architecture. 


A  key  design  consideration  involves  memory  sizing  and  speed  selection.  A  DSP  is  very 
flexible  in  handling  different  memory  types  and  the  slower  memories  can  be  interfaced  by  introducing 
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wait  states.  Memory  is  perhaps  the  most  difficult  tradeoff  because  it  depends  on  the  scope  of  the 
statisticians  problem.  For  any  one  user,  we  must  determine  the  access  speed  of  the  memory  and  the 
storage  capacity  required.  Since  high  speed  memory  is  expensive,  a  proper  balance  between  cost  and 
speed  can  be  achieved  using  slower  spt^  memory  for  bulk  data  storage  and  higher  speed  memory 
for  computationally  intensive  parts  of  the  program.  This  approach  is  straightforward  because  the  DSP 
can  handle  two  different  memoiy  banks  with  different  »xess  times.  Thus,  the  use  of  high  speed  static 
memory  (SRAM)  for  program  and  dynamic  memory  (DRAM)  for  data  storage  may  be  suitable  for 
the  statistics  workstation  design. 

5J.4  Commercial  DSP  Boards 

Since  the  available  DSP  boards  differ  in  their  architectures,  their  manufacturers  supply 
application  software  for  use  with  a  specific  board.  This  software  typically  includes  software  modules 
for  program  and  data  downloading  to  the  DSP  board  and  data  uploading  to  the  host  processor. 
Appendix  D  gives  descriptions  of  some  of  the  commercially  available  DSP  boards.  In  this  study  a 
DSP  board  manufactured  by  CAC,  Inc.  was  selected. 

5.2.5  Graphics  Processor 

As  the  DSP  can  provide  computation  speed  advantage,  the  use  of  a  specialized  graphics 
coprocessor  could  provide  a  substantial  improvement  in  display  capability.  The  use  of  a  high-speed 
graphics  coprocessor,  however,  would  be  cost-effective  only  in  those  situations  where  continuous  real¬ 
time  display  capability  is  needed.  Further  improvement  could  be  obtained  by  directly  interfacing  the 
DSP  with  the  graphics  coprocessor.  For  most  of  the  other  graphics  display  needs,  the  conventional 
graphics  support  (PC-based)  would  be  sufficient  Due  to  the  time  constraints,  only  the  PC-based 
approach  was  investigated  during  the  Phase  I  effort. 

5.3  Statistics  Workstation  Functional  Design  and  Operation 

Next  to  the  hardware  architecture,  the  functional  design  of  the  statistics  workstation  will  have 
a  major  impact  on  performance.  Particularly  important  will  be  function  assignment  to  the  different 
processors  residing  in  the  system.  In  addition,  to  achieve  optimum  performance,  the  workstation 
system  control  program  must  be  fast  and  simple  thus  reducing  overhead. 

53.1  Function  Assignment  to  the  Host  PC  and  DSP 

The  optimum  partitioning  of  computation  tasks  between  the  host  processor  and  the  DSP  is 
critical  to  achieving  the  best  performance.  This  task  assignment,  however,  is  complicated  because  the 
DSP  can  operate  in  parallel  with  the  host  processor.  In  addition,  the  DSP  is  inherently  a  parallel 
device.  Thus  the  proper  balance  can  be  achieved  only  by  a  careful  consideration  of  all  aspects  of  this 
problem,  including  data  transfer  and  pipelining. 

53.2  Host  PC  Functions 

The  highest-level  operations  are  controlled  by  the  host  PC  CPU.  The  parallel  configuration 
allows  the  system  to  do  multitasking  with  no  performance  degradation  if  a  careful  partitioning  of  tasks 
is  chosen.  For  example,  the  host  processor  could  work  on  preprocessing  a  data  set,  while  the  DSP 
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is  performing  a  lengthy  iteration  (see  Figure  5.4).  Many  other  similar  implementations  are  possible, 
such  as  multiple  DSP’s  to  reduce  computation  time. 

Tasks  assigned  to  the  host 
processor  include  tlm  management 
of  all  data  and  the  control  of  the 
display.  The  statistics  workstation 
master  control  includes  contrcriling 
the  host  and  DSP  programs,  data 
downloading  to  the  DSP,  initiating 
DSP  operations,  as  well  as  retrieving 
computation  results  from  the  DSP. 


The  serial  data  transfer  to 
the  DSP  can  be  a  lengthy  process. 
If  the  data  transfer  is  not  optimized, 
then  the  computation  process  can 
easily  become  input/output  (I/O) 
bound. 


S33  DSP  Functions 

The  DSP  performs  the  bulk 
of  floating  point  computations.  In  addition,  the  DSP  performs  floating  point  conversions  (IEEE  to 
internal  and  internal  to  IEEE  format). 

The  DSP  normally  operates  in  a  slave  mode  to  the  host  processor.  Since  the  DSP  is  driven 
by  the  resident  program,  it  will  have  a  relatively  high  level  of  autonomy,  including  local  control,  thus 
reducing  the  host  control  task  complexity. 

The  DSP  has  status  registers  to  indicate  error  conditions.  These  can  be  monitored  and  the 
recovery  from  error  conditions  could  be  performed  either  locally  or  delegated  to  the  host  processor. 
In  this  way,  the  host  acts  as  a  software/hardware  monitor. 

Thus,  the  DSP  operation  differs  considerably  from  that  of  a  conventional  coprocessor  (c.f. 
Figure  9.1).  Whereas  the  DSP  executes  an  internal  program,  the  coprocessor  only  executes  those 
instructions  which  are  identified  for  the  coprocessor.  Furthermore,  before  any  arithmetic  instruction 
can  be  executed  in  the  coprocessor,  the  data  must  be  downloaded.  As  a  result,  there  is  a  much 
heavier  I/O  data  transfer  between  the  coprocessor  and  the  host  processor. 

Multiprocessor  DSP  Extensions.  Most  commercial  DSP’s  are  suitable  for  use  in  a 
multiprocessor  environment.  In  a  statistics  workstation,  the  extra  processors  could  handle  tasks  such 
as  random  number  generation  or  the  computation  of  some  complex  functions. 

53.4  Shared  Functions 

There  are  some  tasks  that  could  be  divided  between  the  host  and  the  DSP.  For  example,  in 
handling  graphics  displays,  the  DSP  can  perform  floating  point  to  integer  conversion  much  faster  than 


Figure  5,4  Flowchart  for  DSP  operation  with  concurrency. 
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the  host  processor.  Thus,  graphics  data  could  be  preprocessed  by  the  DSP  before  they  are  uploaded 
to  the  host  processor. 

53.5  Data  Traasfcr 

Data  transfer  at  the  system  level  can  be 
optimized  using  direct  memory  access  (DMA) 
block  transfers.  Provisions  for  data  buffering 
must  be  made  for  the  transfer  of  larger  blocks. 

All  of  these  transfers  are  accomplished  in  the . 
programmed  block  transfer  mode  (see  Figure 
5.5). 


To  achieve  the  highest  efficiency,  the 
I/O  transfer  between  the  host  and  the  DSP  is 
minimized.  This  means  that  the  data  reside  in 
the  DSP  as  long  as  possible  and  that  the 
computation  tasks  are  partitioned  in  such  a  way  as  to  reduce  the  data  transfer.  Minimum  data 
transfer  will  also  affect  the  data  management  strategy  and  will  mean  that  sufficient  DSP  memory 
space  be  made  available.  As  memory  prices  continue  to  decline,  the  direct  use  of  DSP  memory  for 
data  storage  will  become  more  attractive’. 

5.4  Extensions  for  Statistical  Applications 

Another  promising  approach  is  to  multitask  existing  systems.  Eddy  [1986b,d]  describes  a 
multiprocessor  VAX  system  for  statistical  calculations.  The  improvement  in  this  area  depends  on  the 
numter  of  processors  and  is  limited  by  the  needed  overhead.  In  addition,  for  this  type  of  setup  only 
a  limited  number  of  users  have  access  to  the  system  of  interconnected  processors. 

In  the  future  systems  we  can  expect  that 
multi-tasking  will  be  supported  internally  to  the 
statistics  workstation  as  well  as  in  the  external 
environment,  as  shown  in  Figure  5.6. 


Figure  5.6  Future  DSP-based  statistics 
workstation  expansion. 


Figure  53  Parallel  bus  transfer  between  host  and 
DSP. 


’  Presently  available  commercial  DSP  boards  feature  up  to  8  MB  of  memory  for  less  than  $10K. 
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6  DSP  SOFTWARE  DEVELOPMENT  NEEDS 


This  section  outlines  the  software  development  objectives,  discusses  how  the  statistics 
workstation  software  will  be  structured,  and  what  capabilities  will  be  needed  at  the  various  levels  in 
this  structure. 

6.1  DSP  Software  Development  Objectives 

The  overall  objective  of  the  statistics  workstation  development  is  to  provide  a  high-speed 
solution  to  a  wide  range  of  computation-intensive  statistical  problems  with  sufficient  accuracy  and 
acceptable  presentation  of  results.  In  particular,  if  accuracy  is  not  provided  by  the  hardware,  different 
software  algorithms  can  be  substituted 

The  key  software  development  objective  is  to  reduce  program  complexity.  Although  it  would 
be  highly  desirable  to  develop  automatic  programming  support  for  the  DSP  to  reduce  programming 
effort  and  improve  reliability,  the  complexity  of  the  DSP  architecture  does  not  permit  an  easy  solution 
of  this  problem.  This  complexity  is  in  some  resp«:ts  similar  to  that  of  supercomputers  and  vector 
processors.  After  years  of  effort  and  considerable  experience  with  supercomputers  and  vector 
processors,  there  is  still  a  need  to  perform  a  low-level  (assembly  language)  manual  optimization  to 
achieve  the  desired  speed  improvement  As  recent  research  has  shown,  considerable  improvement 
can  be  achieved  only  by  extending  the  level  of  this  optimization,  as  in  the  case  of  BIAS,  BLAS-2, 
and  BLAS-3  [Harrod  1987].  The  availability  of  a  comprehensive  set  of  library  modules  also  helps  to 
reduce  program  development  costs  and  improves  program  transportability  to  a  different  DSP. 

The  initial  development  objectives  are  to  select  the  assembly  language  interface,  high-level 
language,  macro  utility,  and  to  identify  other  software  development  aids. 

6.2  Programming  Tool  Selection 

The  programming  tool  selection  includes  selection  of  programming  languages  and  any  other 
aids  that  can  be  used  to  assist  in  the  software  development  process. 

The  approach  taken  in  this  feasibility  investigation  for  statistical  software  development 
involves  using  the  C  compiler  with  predefined  optimized  subroutines  which  are  contained  in  the  DSP 
library.  The  presently  available  DSP  libraries  contain  good  collections  of  general  purpose  signal 
processing  routines  which  can  be  interfaced  with  the  C  language.  These  routines  alra  provide  the 
basic  functions  such  as  division,  power,  square  root,  trigonometric,  and  exponential  functions. 
However,  they  seldom  contain  any  of  the  more  specialized  statistical  routines  described  in  section  4. 
These  low-level  routines  must  be  hand-coded  and  included  separately. 

Although  not  as  optimal  from  a  computation  perspective,  coding  the  rest  of  the  statistical 
functions  in  a  high-level  language  is  a  much  faster  and  less  error-prone  process  than  hand  coding. 
The  C  compilers  can  do  program  initialization,  startup  routines,  and  I/O  operations.  Structured  C 
programs  are  easier  to  maintain  and  debug  if  optimization  is  not  important.  For  example,  a  stack  for 
subroutine  calls  will  automatically  do  all  of  the  register  bookkeeping. 
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When  speed  is  critical,  and  function  call  overhead  is  intolerable,  in-line  assembler  code  within 
high-level  language  programs  can  be  used.  This  permits  the  programmer  to  write  the  critical  portions 
of  the  code  directly  in  assembly  language. 

6.2.1  Assembly  Language 

Assembler  coding  is  the  most  common  approach  to  DSP  programming  and  is  widety  used  in 
those  situations  where  it  is  important  to  minimize  program  size  and  optimize  speed,  such  as  in  the 
low-level  routines.  Properly  done,  assembler  coding  can  result  in  highly  efficient  code  (c.f.  i^pendix 
A). 

Unfortunately,  in  larger  programs,  assembler  coding  is  a  very  slow  and  difficult  process  and 
is  subject  to  errors.  Assembler  coded  programs  are  also  more  difficult  to  debug  and  maintain. 
Further,  since  DSP  instructions  are  unique  to  a  specific  manufacturer  and  subject  to  major  changes 
with  each  new  major  release,  it  may  be  difficult  to  maintain  sufficient  experience  in  DSP 
programming. 

The  differences  in  DSP  architectures  prevent  direct  transfer  of  assembler-coded  statistical 
algorithms  from  one  DSP  to  another.  This  situation  is  similar  to  that  faced  by  the  supercomputer 
programmers.  They  are  also  required  to  develop  or  modify  the  lower  level  modules  for  each  change 
in  computer  architecture.  Therefore,  the  device  dependence  has  a  major  impact  on  future 
development  efforts.  No  simple  solution  exists  for  this  problem,  because  a  standard  architecture 
would  have  a  negative  effect  on  future  system  development.  However,  the  availability  of  well-defined 
library  standards  can  help  in  the  updating  phase. 

The  selection  of  the  assembly  language  is  determined  by  the  selected  hardware.  Usually  the 
only  source  available  is  the  device  manufacturer.  Most  of  the  DSP  assemblers  have  the  capability  to 
interface  to  a  higher  level  language  compiler  such  as  C.  For  more  detail  on  the  assembly  language 
format,  see  Appendix  F. 

6.2.2  High-level  Programming  Language 

A  high  level  programming  language  is  needed  to  reduce  the  statistical  software  development 
effort.  Unfortunately,  the  conventional  programming  languages,  such  as  FORTRAN,  Pascal,  C, 
Modula-2,  and  Ada,  have  been  develop^  to  support  standard  processors  and  seldom  have  the 
capability  needed  to  exploit  the  special  features  that  are  available  in  DSP’s.  As  a  result,  these 
languages  are  not  particularly  well  suited  for  DSP  programming  if  optimized  results  are  desired. 
However,  they  can  provide  a  very  cost-effective  solution  for  those  parts  of  the  program  which  are  not 
particularly  computation-intensive  or  which  can  call  on  the  optimized  routines. 

The  only  widely  available  high-level  compilers  for  DSP’s  are  C  compilers.  A  C  compiler 
provides  fast,  but  not  always  optimal  code  for  the  DSP.  Because  of  DSP  programming  constraints 
on  pipelining,  specific  instruction  sequence,  and  operation  execution  sequence,  the  C  language 
compilers  are  not  capable  of  performing  optimization  at  the  lowest  level.  They  do  not  have  the 
ability  to  modify  the  algorithms  to  a  difierent,  yet  equivalent  form  and  cannot  look  ahead  to 
conditional  branching  effects. 

Since  a  C  compiler  is  device  dependent,  a  different  compiler  is  needed  for  each  device. 
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Furthennore,  since  the  compilers  are  expensive  (in  comparison  to  PC  language  compilers),  supplying 
a  compiler  for  each  E)SP  can  be  prohibitive.  Performance  benchmarks  which  use  the  DSP32  C 
compiter/assembler  are  given  in  Section  8. 

6JL3  Macro  Processors 

A  macro  processor  provides  an  alternate  approach  to  assembly-level  programming  and  a  tool 
for  high-level  languages.  A  beneGt  of  a  macro  processor  is  that  it  provides  a  fast  way  of  generating 
relatively  error-free  code  in  a  well  structured  environment  It  also  permits  the  use  of  highly 
optimiz^  code  segments  which  can  be  readily  adapted  in  different  parts  of  the  program.  The  macros 
can  be  made  device  independent  by  changing  the  template  rules.  This  allows  the  generation  of  code 
for  different  hardware  configurations.  However,  the  majority  of  the  present  DSP  assemblers, 
excepting  the  Motorola  96K  assembler,  do  not  yet  provide  full  macro  capabilities.  Limited  macro 
definition  capability  is  available  in  the  AT&T  E^P32  using  the  #define  construct  that  is  evaluated 
by  a  preprocessor. 

In  this  study,  the  STAGE2  macro  generator  [Waite  1973]  was  selected  for  template  matching 
and  low  level  programming.  However,  we  found  that  for  efficient  use,  it  requires  the  development 
of  an  optimiz^  set  of  specific  macros. 

6.2.4  Other  DSP  Compilers 

It  may  be  possible  to  develop  a  compiler  that  is  more  DSP-oriented  than  the  general  purpose 
C  language.  One  approach  uses  a  compiler-compiler  generator,  such  as  YACC  from  the  UNIX 
system.  If  both  the  standard  lexical  scanner  and  the  code  generator  are  used  then  it  is  possible  to 
update  the  compiler  accurately  and  efficiently  in  case  future  changes  are  required. 

Another  approach  would  involve  the  development  of  a  higher-order  language,  specifically 
tailored  for  statistical  problem  definition.  One  potential  starting  point  is  to  use  an  existing  hardware 
description  language,  such  as  VHDL  [IEEE  19^],  and  then  modify  it  to  include  statistical  concepts. 
A  different  approach  could  use  the  S  language  as  the  starting  point 

The  use  of  an  APL-like  language  in  a  DSP  environment  could  also  be  investigated.  To  our 
knowledge,  this  approach  has  not  yet  been  investigated.  APL  problem  formulation  is  very  good  for 
vector  and  matrix  math,  but  the  terse  language  often  makes  the  programs  difiicult  to  read  or 
maintain.  There  has  been  some  emphasis  on  including  vector  operations  in  the  new  FORTRAN  and 
C  standards  (i.e.  FORTRAN  88  and  Numerical  C'*^. 

6.3  Software  Development  Guidelines 

In  the  remainder  of  section  6,  we  will  examine  the  individual  steps  in  the  software 
development  process  using  both  assembly  and  higher  level  languages.  The  individual  steps  in  DSP 
program  development  are  shown  in  Figure  6.1  and  listed  below. 


ANSI  C  standard  committee  X3J1 1.1 
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o  Create  optimized  low-level  routines 

o  Write  high-level  control  in  C 

o  Compile  and  link  low-level  with 
high-level 

o  Compare  speed  and  debug  with  respect 
to  the  host-compiled  C  code 

The  reduction  of  problem  complexity 
can  be  best  achieved  by  partitioning  the 
problem  into  subproblems  for  which  it  is  easier 
to  identify  the  solution  techniques.  In  the 
statistics  workstation,  the  problem  partitioning 
involves  task  assignments  to  the  host 
microprocessor  and  the  slave  DSP.  This 
partitioning  should  be  designed  to  use  the  best 
capabilities  of  both  devices. 


LjOWLEVEL 

(inner  loops) 

file2.c 

HIGH-LEVEL 

1 

i 

( itsration  control. 

file2.i 

optimize 
by  hand 

outer  loops,  etc. ) 

1 

filel  .c 

file2.o 

- » 

place  in 

IttKaiy 

Compile  and  link  *&ouf 

Figure  6.1  DSP  software  development  using  the 
C  language. 


To  achieve  the  desired  high  efQciency,  a  careful  examination  of  the  mapping  of  the 
computation  algorithms  to  the  statistics  workstation  and  DSP  architecture  must  be  undertaken.  This 
step  will  be  most  effective  only  at  the  lowest  levels  of  the  program  structure  where  it  will  affect  the 
basic  building  blocks.  Once  these  blocks  have  been  identified,  optimized,  and  incorporated  in  a 
library,  they  will  be  ready  to  be  used  with  a  conventional  programming  languages,  such  as  C. 


Thus,  although  higher  level  compilers  can  do  much  to  reduce  the  programming  effort, 
improved  efficiency  can  only  be  achieved  by  manual  optimization  at  the  BLAS  and  BSAS  level. 
Fortunately,  there  are  only  a  limited  number  of  frequently  used  lower  level  modules  which  need  to 
be  optimized.  These  modules  can  be  easily  identiHed,  and  support  libraries  developed.  This 
approach  was  followed  during  the  statistics  workstation  feasibility  investigation. 


6J.1  Algorithm  Hierarchy 


The  algorithm  design  followed  a  three  level  hierarchy  (see  Section  4  and  Figure  4.1).  At  the 
lowest  level  the  building  blocks  were  identified.  At  the  intermediate  level,  basic  statistical  algorithms 
were  identified.  At  the  top  level,  the  outer  loops  for  the  computation-intensive  statistics  were 
investigated.  The  initial  question  to  answer  at  each  of  these  levels  was  whether  to  keep  the  level  in 
host  or  DSP  and  whether  to  code  in  DSP  assembly  language  or  C.  In  addition,  we  observed  that  the 
hierarchy  of  a  third  level  application  such  as  b^tstrapping  makes  it  an  ideal  starting  point  for 
multiple  DSP  operation.  Similarly,  dedicated  DSP’s  for  random  number  generation  and  graphics 
support  could  provide  performance  improvements  at  the  lower  levels. 

First  Level  -  BLAS  with  BSAS  Extensions.  This  is  the  lowest  level  in  the  software  hierarchy. 
It  consists  of  all  of  the  key  linear  algebra  algorithms  (BLAS)  with  extensions  for  handling  statistical 
procedures  such  as  mean  and  variance  (BSAS),  as  described  in  Section  4.1.  Since  there  are  many 
such  statistical  subroutines,  only  those  that  were  expected  to  be  used  widely  and  affect  the  evaluation 
were  selected  for  detailed  investigation. 

Since  BLAS  and  BSAS  are  at  the  lowest  level  in  algorithm  hierarchy,  it  is  also  the  most 
optimized  level  to  take  advantage  of  the  DSP’s  computing  speed.  Thus,  all  of  these  modules  were 
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programmed  in  DSP  assembly  language  to  achieve  the  highest  speed  improvement.  E)etails  on  the 
subroutines  in  further  described  in  Appendix  A. 

Second  Level  -  Statistical  Functions.  This  level  uses  the  basic  building  blocks  provided  by  the 
BSAS  and  BIAS  subroutines  to  create  more  complex  statistical  operations  such  as  regression, 
correlation,  or  singular  value  decomposition.  This  level  also  represents  capabilities  similar  to  those 
available  in  the  LINPACK  routines.  To  maintain  a  speed  advantage,  these  routines  must  be  kept  in 
the  DSP.  The  tradeoff  of  a  slightly  lower  speed  versus  a  much  reduced  code  complexity  allowed  the 
majority  of  these  subroutines  to  be  programmed  in  C. 

Third  Level  -  Computation-intensive  Application  Proff-ams.  The  third  or  applications  level 
represents  actual  computation-intensive  statistical  operations,  such  as  bootstrapping  on  a  correlation 
or  regression  analysis.  As  at  the  second  level,  it  was  initially  questioned  whether  this  level  should  be 
programmed  in  DSP  assembly  language  or  in  a  higher-level  language,  such  as  C.  Since  optimization 
at  this  level  has  less  effect  on  the  program  performance  than  optimization  at  a  lower  level,  the  use 
of  C  at  this  level  reduces  both  programming  effort  and  debugging. 

Note  that  in  the 
development  of  the  workstation  we 
are  using  two  different  C  compilers. 

One  of  these  is  used  for  compiling 
the  host  program,  whereas  the  other 
is  used  for  compiling  the  DSP 
program  (see  Figure  6.2).  Both  of 
these  compilers  have  to  be 
compatible  with  their  data 
structures.  A  clearly  defined 
interface  between  the  two  compiled 
programs  makes  this  high-level 
language  support  possible.  This 
interface  specifies  the  needed  data 
structures  and  the  data  transfer 
protocol  (see  Section  7). 


63.2  Problem  Oriented  Language 

In  addition  to  the  use  of 
higher  level  languages,  another  objective  of  the  statistics  workstation  design  was  to  investigate  the 
feasibility  of  providing  a  problem  oriented  language  interface,  similar  to  that  currently  available  in 
the  S  language. 

The  S  language  has  undergone  considerable  changes  since  its  introduction.  The  most  recent 
version  [Becker  1988]  is  more  C  language  oriented  and  as  such  is  more  suitable  to  serve  as  a  basis 
of  comparison  for  the  statistics  workstation  interface  development.  One  approach  to  providing  this 
problem  oriented  interface  is  based  on  using  precompiled  modules  in  conjunction  with  a  user 
command  interpreter.  Although  a  complete  command  interpreter  would  require  considerable 
programming  effort,  the  initial  investigation  confirmed  the  feasibility  of  this  approach.  Efron  [1986] 
noted  that  updating  the  S  language  for  computation-intensive  applications  is  not  that  difficult.  For 


Figure  6.2  Dual  software  development  for  host  and  DSP. 
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example,  bootstrapping  a  correlation  coefGcient  reduces  to 

tboot(data,  correlation,  B=1000) 
in  the  S  language,  where  B  is  the  number  of  bootstraps. 

Beyond  this  level,  software  support  is  needed  to  simplify  interfacing  the  statistical  analysis 
programs  to  other  user  application  programs.  For  example,  a  link  to  a  higher-level  language  or  to 
a  spreadsheet,  wordprocessing,  or  graphics  pro^m  may  desirable.  Providing  a  spreadsheet-based 
input  is  a  particularly  important  feature  of  the  statistics  workstation  because  it  permits  the  use  of 
statistical  computation  results  as  part  of  more  complex  models. 

6J  J  Data  Structures 

The  selection  of  data  structures  for  use  in  the  statistics  workstation  design  is  an  important 
decision  because  the  data  structure  not  only  has  a  major  influence  on  performance  during  data 
transfer,  but  also  during  the  actual  computations. 

Data  transfer  to  the  DSP  is  handled  by  the  host  processor.  Data  conversion  to  the  needed 
format  is  best  handled  by  the  DSP  because  of  the  higher  speed.  Since  the  DSP  is  capable  of  a  single 
instruction  float-to-integer  conversion,  this  capability  should  be  used  whenever  integer  data  is  needed 
in  the  main  program. 

Data  storage  schemes.  A  uniform  data  storage  scheme  will  not  only  speed  up  computations, 
but  will  also  help  during  debugging.  For  example,  data  storage  is  particularly  important  when  fast 
matrix  solution  algorithms  are  used.  The  convention  in  this  case  is  to  store  the  matrix  data  by  row 
major.  For  other  data  structures,  the  program  data  structure  must  be  carefully  examined  to  determine 
the  optimum  partitioning  scheme. 

All  of  the  prototyped  BLAS  and  BSAS-based  algorithms  used  a  common  storage  approach 
(row  major).  Since  the  majority  of  these  subroutines  are  used  in  conjunction  with  the  C  language, 
register  assignment  and  usage  needed  by  the  C  program  is  strictly  ob^rved. 

Global  data.  Global  data  handling  represents  some  unique  problems.  Normally  all  of  the 
needed  global  data  should  be  downloaded  to  the  DSP  to  reduce  the  need  for  continuous  data  access 
to  the  host  memory.  In  future  implementations,  a  common  dual  access  memory  may  provide 
improvement.  This  will,  however,  require  that  a  standard  format  (IEEE  standard)  for  float  variables 
be  used. 

Memory  management.  In  the  feasibility  study,  memory  allocation  is  provided  in  the  DSP  at 
the  compile  stage  (static  memory).  There  are  also  several  undocumented,  but  available,  functions  in 
the  AT&T  C  compiler  suitable  for  dynamic  memory  allocation.  This  is  critical  for  applications  where 
high  speed  memory  is  at  a  premium.  Several  other  general  memory  management  schemes  may  also 
be  emplo]^.  In  one  scheme,  the  host  would  be  responsible  for  the  memory  management.  A 
combination  of  DSP  and  PC  memory  management  is  alM  possible. 
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6J.4  DSP  Programming  Approach 

Although  many  of  the  statistical  problems  are  easy  to  set  up  and  to  solve,  some  of  the  more 
recent  statistical  methodologies  are  not  only  complex,  but  also  require  substantial  setup  time.  These 
problems  can  seldom  be  expressed  in  a  simple  sequence  of  steps.  Thus  their  solution  demands 
considerable  flexibility  in  the  statistics  workstation  design.  As  the  problem  complexity  can  be  reduced 
by  modular  structuring,  identification  of  the  basic  building  blocks  should  be  made  whenever  possible. 

The  emphasis  in  the  statistics  workstation  software  development  effort  is  to  achieve  the  best 
possible  speed  improvement.  This  required  using  the  unique  capabilities  of  the  DSP  to  the  maximum 
extent  possible.  Particularly  important  was  the  use  of  compound  operations,  autoincrementing  of 
addresses,  and  fully  loading  the  parallel  structure. 

We  found  that  many  of  the 
programming  techniques  developed 
for  numerical  coprocessors  were  not 
directly  applicable  in  the  DSP 
environment  for  a  number  of 
reasons,  as  explained  below.  First, 
whereas  numeric  coprocessors 
operate  in  line  with  the  main 
microprocessor  and  share  a  common 
instruction  structure,  the  DSP  is  an 
autonomous  device  with  its  own 
program  storage  and  instruction  set. 

Second,  when  using  the  DSP,  both  the  program  and  the  data  must  be  downloaded  and  results 
retrieved  (see  Figure  6.3).  The  numeric  coprocessor,  on  the  other  hand  also  requires  data  load,  but 
accepts  only  a  single  instruction.  The  conventional  numeric  coprocessors  use  fixed  microprograms 
in  a  stack  mode  and  do  not  support  internal  programming.  Third,  the  numeric  processors  have  very 
limited  data  storage  capacity.  Therefore,  data  must  be  downloaded  every  time  it  is  needed.  The  DSP 
on  the  other  hand  has  more  capacity  for  data  storage,  requires  less  data  transfer,  and  operates 
independently. 

Thus,  the  program  development  for  DSP  applications  had  to  follow  a  different  set  of  rules  - 
direct  translation  of  programs  developed  for  use  with  numeric  coprocessors  could  seldom  achieve 
the  potential  speed  improvement  possible  with  the  DSP’s.  It  also  meant  that  some  new  and  unique 
algorithms  had  to  be  developed  and  the  developed  code  optimized  for  speed. 

Tasking  of  Statistical  Procedures.  Every  statistical  procedure  involves  three  distinct  phases: 
setup,  operation,  and  transfer  of  results.  When  computation-intensive  statistical  procedures  are 
selected,  most  of  the  processing  time  is  spent  in  the  second  phase. 

For  each  set  of  statistical  computations,  we  can  distinguish  those  basic  operations  that  belong 
to  the  host  or  to  the  DSP,  or  to  those  that  use  the  capabilities  of  both  processors.  The  DSP  is  most 
efficient  when  floating  point  computations  are  performed  in  parallel  with  indexing  in  the  DSP.  A 
further  objective  is  to  balance  the  operations  between  the  host  processor  and  the  DSP  (see  Figure 
6.4).  In  particular,  it  is  important  to  keep  the  PC  busy  while  waiting  for  the  DSP  to  complete  its 
computations. 
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Figure  6.4  Task  assignment  for  host  and  DSP. 


The  important  issues  in  tasking  are  computation  task  duration  and  the  specific  task 
assignments.  Some  of  the  computations  can  be  delayed  due  to  the  DSP  pipeline  effects.  For  these 
instructions  we  have  to  consider  the  number  of  cycles  needed  before  the  result  is  available,  and  the 
number  of  wait  cycles  (due  to  memory  conflicts).  The  computation  task  assignment  must  further 
consider  DSP  characteristics,  such  as  concurrent  index  updating  and  multiply-accumulate  instructions. 

Statistical  Algorithm  Optimization.  Statistical  algorithm  development  is  a  two-step  process. 
First,  the  low  level  loops  are  examined  and  optimized  subroutines  introduced.  Second,  the  selected 
algorithms  are  modified  to  favor  MAC  operations.  To  achieve  the  best  speedup,  it  is  important  to 
identify  the  inner  loops  that  are  repeated  many  times.  As  addressed  previously,  optimization  of 
operations  is  most  effective  when  performed  at  this  level.  The  low  level  optimization  must  be 
performed  in  assembly  language.  This  optimization  requires  great  familiarity  with  the  DSP 
architecture  and  command  structure.  After  this  optimization,  the  C  language  programming  of  the 
DSP  is  similar  to  developing  conventional  programs. 

There  are  several  factors  which  affect  how  optimization  is  accomplished.  Some  of  these 
include  maximizing  operation  efficiency,  minimizing  program  size,  or  maximizing  program  speed. 
Operation  efficiency  determines  how  efficient  the  program  is  in  solving  the  user’s  problems  with 
regard  to  wall-clock  time  as  well  as  accuracy. 

The  automatic  code  optimization  problem  is  very  difficult  and  a  simple  solution  cannot  be 
expected  without  the  development  of  new  techniques.  Its  solution  will  probably  use  vanous  AI 
techniques  such  as  pattern  recognition.  However,  considerable  improvement  in  program  efficiency 
can  be  achieved  if  C  code  is  structured  in  such  a  way  that  it  reflects  the  DSP  instruction  set  and 
architectural  constraints".  Although  a  DSP  C  compiler  can  do  an  adequate  job,  our  experience 
shows  that  even  the  most  highly  optimized  code  can  be  improved  by  up  to  50%  by  further  hand 


"  One  of  the  recommendations  made  by  AT&T  concerning  the  use  of  the  C  language  is  to  think  how  the 
program  could  be  coded  in  the  assembly  language  and  then  to  write  a  program  that  maps  well  to  the  hardware 
[AT&T  1988].  This  implies  that  pointer  addressing  should  be  used  instead  of  array  addressing,  wherever  possible. 
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optimization  of  the  intermediate  assembler  code. 

Optimization  of  a  C  program  usually  requires  compiling  the  program  twice.  The  first 
compilation  is  used  to  determine  those  areas  where  potential  improvement  is  possible  by  rearranging 
the  C  code.  After  these  modifications  are  made,  the  second  compilation  then  leads  to  a  more 
efficient  version.  Unfortunately,  the  use  of  hand-coded  optimization  creates  new  problems  if  code 
portability  at  the  C  language  level  is  desired. 

The  optimization  of  the  low-level  DSP  routines  involves  a  number  of  different  approaches 
and  constraints.  Some  of  the  more  important  are  outlined  below. 

6.4  Computation  Speed  Optimization. 

This  section  contains  a  brief  description  of  techniques  used  in  programming  and  optimizing 
the  DSP  statistical  routines.  Since  the  conventional  high-level  program  development  process  is  well 
known,  in  this  section  we  will  concentrate  on  those  program  development  aspects  which  are  unique 
to  the  DSP  and  specifically  to  AT&T  DSP32  programming. 

Loop  recognition.  In  most  statistical  programs,  the  highest  percentage  of  computations  occur 
in  the  inner  loops.  Thus,  the  primary  concern  should  be  placed  on  inner  loop  identification  and 
optimization  to  achieve  fast  execution. 

Branching  operations.  Branching  operations  that  are  supported  by  the  DSP  include 
conditional  branching,  loop  counter  branching,  call  subroutine,  return  from  subroutine,  and  the 
unconditional  goto.  Testing  for  conditions  is  an  expensive  operation  in  a  DSP,  because  test  results 
are  not  immediately  available  due  to  the  pipelining  effects.  Thus,  the  conditional  branching  is  based 
on  test  results  obtained  four  instructions  earlier.  An  alternative  to  the  conditional  branching  is 
provided  by  the  conditional  accumulator  load  instruction  which  does  not  suffer  from  the  lengthy 
delay.  This  instruction  is  particularly  effective  in  inner  loops. 

Pipelining  and  interleaving.  The  most 
important  cc  .rtraints  are  those  imposed  by 
pipelining  of  operations.  In  this  context, 
pipelining  means  that  the  results  of  the  more 
complex  floating  point  operations  may  not  be 
available  for  several  instruction  cycles.  The 
sequencing  of  operations  is  particularly 
important  if  efficiency  of  computations  is  to  be 
optimized. 

Pipelining  of  DSP  instructions  is 
illustrated  in  Figure  6.5.  To  satisfy  pipeline 
constraints,  the  programmer  must  insert  a 
number  of  "no  operations"  or  "nops"  to  comply 
with  these  restrictions.  Although  these  added  instructions  satisfy  the  pipeline  constraints,  they  have 
the  effect  of  slowing  down  the  computations. 
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By  interleaving  operations,  the  efficiency  of  computations  can  be  increased.  This  involves 
replacing  nop  instructions  with  other  instructions  that  are  not  dependent  on  the  current 
computations.  However  the  resulting  code  becomes  not  only  more  difficult  to  understand,  but  also 
more  difficult  to  debug. 

When  interleaving,  it  is  important  to  consider  several  factors  simultaneously.  These  factors 
include  the  available  instruction  cycles  assigned  to  nop  instructions,  available  registers,  and  the  set 
of  instructions  which  could  be  executed  in  a  different  sequence. 

Although  interleaving  appears  to  be  a  simple  technique,  efficient  use  requires  a  great 
familiarity  with  the  computation  algorithms,  something  not  usually  available  during  the  compilation 
process.  Thus,  interleaving  is  particularly  difficult  to  do  automatically.  The  best  approach  is  to 
develop  the  algorithm  first  without  considering  the  interleaving.  After  the  algorithm  has  been  fully 
debugged,  note  the  locations  of  all  of  the  nop  operators,  and  then  determine  which  of  the  succeeding 
instructions  could  be  moved  to  these  locations. 

Renter  and  accumulator  assiffvneru.  Since  only  a  limited  number  of  registers  and 
accumulators  are  available,  computation  optimization  must  consider  availability,  reachability,  and 
effectiveness. 

An  availability  chart  can  clearly  identify  those  registers  which  have  been  already  assigned  and 
which  are  available  for  use  (see  Figure  6.6).  Typically,  a  set  of  registers  is  allocated  for  system  use. 
If  these  registers  are  to  be  used,  they  must  be  saved  and  reset  after  the  operations  have  been 
completed. 

Reachability  refers  to  data  indexing.  A  location  is  easily  reachable  if  it  can  be  accessed  as 
part  of  the  normal  register  incrementing  process.  Furthermore,  data  can  be  retrieved  faster  if  the 
address  is  already  available  in  one  of  the  address  registers. 

Effectiveness  of  keeping  certain  values  or  data  in  accumulators  and  registers  depends  to  a 
great  extent  on  data  usage  and  availability  of  registers  and  accumulators.  Good  data  structure  layout 
can  greatly  improve  processing  speed,  lliis  is  particularly  important  when  working  with  matrices  or 
other  more  complex  data  structures. 

Register  indexing  operations.  Register  indexing  requires  careful  consideration  of  the  data 
storage  layout.  The  fastest  access  will  be  obtained  if  registers  can  be  incremented  in  a  constant 
manner  as  in  pointer  incrementing. 

Floating-point  considerations.  When  performing  floating  point  computations,  such  as  summing 
arrays,  data  should  remain  in  the  accumulator  if  possible.  This  approach  results  in  a  higher  accuracy 
because  the  accumulators  have  more  significant  digits  than  the  memory  storage.  If  intermediate  data 
saves  are  used,  this  advantage  is  lost 

Subroutine  calls.  Since  the  low-level  statistical  algorithms  are  implemented  as  subroutines, 
it  will  be  necessary  to  examine  how  they  can  be  best  interfaced  with  the  higher  level  languages.  For 
subroutine  calls,  a  number  of  diRerent  approaches  are  possible.  If  only  a  few  parameters  are  needed, 
then  these  could  be  loaded  in  registers  or  accumulators,  before  the  subroutine  is  called.  A  second 
approach  could  store  the  parameters  after  the  subroutine  call.  The  return  registers  then  could  be 


47 


used  to  pick  up  the  needed  parameten.  Of  the 
above  techniques,  direct  passing  of  parameters 
via  registers  or  accumulators  is  the  most  efficient 
from  a  computation  viewpoint  Therefore  it  is 
often  used  for  high-speed,  embedded,  real-time 
applications. 

The  third  approach  involves  use  of  a  call 
stack.  In  this  case  the  parameters  are  placed  on 
the  stack  before  the  subroutine  is  called.  Tha 
approach  is  more  suitable  for  compiled 
programs.  The  overhead  incurred  with  the 
subroutine  calls  involves  parameter  passing, 
register  and  accumulator  saving  and  restoring, 
and  adjusting  the  return  register  value  for  proper 
return  from  the  subroutine.  Although  this 
overhead  could  be  eliminated  by  direct  coding, 
the  advantages  of  structural  programming  are 
lost  and  more  memory  may  be  required. 


Computation  efficiency.  Computation 
efficiency  will  depend  on  the  use  of  compound 
instructions.  If  both  the  DAU  and  CAU  can 
operate  concurrently,  then  maximum  gain  in 
operating  efficiency  can  be  obtained. 


Minimizing  progjram  size.  In  the  past, 
when  memory  was  expensive  and  limited,  much 
effort  was  spent  on  r^ucing  program  size,  often 
at  the  expense  of  increas^  solution  time. 

Memory  costs  are  less  an  issue  today.  However, 
the  high-speed  memory  that  is  used  within  the 
DSP  is  still  expensive  and  usually  limited  in  size. 

As  a  result,  DSP  program  size  optimization  will 
still  be  important 

Maximizing  progmm  speed.  When 
working  with  computation-intensive  statistical  problems,  high  speed  is  a  major  requirement.  Although 
the  use  of  the  DSP  alone  results  in  speed  improvement,  further  optimization  is  still  required  to 
achieve  the  best  throughput  Note,  however,  that  it  is  usually  impossible  to  optimize  both  with 
respect  to  program  size  and  speed. 

Memory  access  delays.  Memory  delays  due  to  the  memory  access  wait  states  can  be  reduced 
by  separating  program  and  data  in  different  memory  banks. 

Other  constraints  and  restrictions.  In  addition  to  pipeline  delays  there  are  other  constraints 
and  restrictions  which  increase  solution  time  [AT&T  1988].  Strict  adherence  to  these  rules  is 
required  to  obtain  reliable  results.  Fortunately,  the  DSP  assembler  will  report  the  majority  of  the 
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Figure  6.6  Register  and  accumulator  availability 
chart.  Example  of  pointer  to  data. 
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restriction  violations.  Often,  a  simple  rearrangement  of  instructions  will  reduce  the  need  for 
introducing  nop  instructions. 


Figure  6.7  DSP  compilation  and  data  flow  process. 


All  of  the  above  considerations  complicate  the  program  development  and  debugging  effort 
and  make  it  more  difficult  to  optimize  the  statistical  subroutines.  This  optimization  is  performed 
manually  now,  because  current  compilers  are  not  capable  of  intelligent  modification  of  statistical 
algorithms.  S^tion  8  presents  the  results  of  algorithm  optimization  and  C  program  development  for 
several  routines,  given  the  above  outlined  approach  to  programming  and  meeting  constraints.  These 
DSP  algorithms  are  compared  against  their  implementation  in  the  PC  environment  (see  Bgure  6.7). 
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7  STATISTICS  WORKSTATION  INTERFACE 


This  section  describes  the  interface  between  the  host  processor  and  the  DSP  board.  Both 
hardware  and  software  aspects  are  considered.  The  statistics  workstation  hardware  interface  includes 
the  internal  link  between  the  host  microprocessor  and  the  DSP  and  the  external  connections  to  other 
systems.  The  workstation  software  interface  includes  operating  system  calls  and  the  software  links 
between  host  and  DSP  programs. 

7.1  Software  Interface 

The  minimum  support  software  includes  DSP  compiler,  simulator,  and  custom  software 
needed  to  integrate  the  DSP  in  a  microcomputer  environment  Additional  custom  software  is  needed 
to  interface  the  DSP  to  the  graphics  display  and  to  the  operating  system. 

The  applications  program  interface  design  depends  on  the  selected  host  language  (C 
language)  and  the  DSP  compiler  characteristics.  The  selected  programming  language  prescribes  a 
statistical  function  call  interface,  which  in  turn  defines  the  lower  level  implementation.  Since  the 
statistical  functions  are  evaluated  in  the  DSP,  the  software  interface  module  must  control  the  loading 
of  statistical  function  modules. 

Since  the  statistics  workstation  software  interface  is  relatively  complex,  a  formal  description 
of  this  interface  is  needed  to  simplify  program  development.  This  description  is  usually  expressi^  in 
a  meta-language. 

7.1.1  Software  Description  Meta-language 

The  objective  of  the  meta-language  development  was  to  define  a  formal  interface  between 
the  main  host  and  the  DSP  programs.  By  a  formal  interface  we  mean  a  capability  similar  to  that  of 
an  Interface  Description  Language  (IDL)  [Snodgrass  1989]  or  a  hardware  description  language,  such 
as  VHSIC  Hardware  Description  Language  (VHDL)  [IEEE  1988]. 

To  begin,  we  must  define  the  software  routines  to  be  interfaced.  A  typical  mathematical 
routine  can  be  described  as  either  a  procedure  (no  return  value)  or  a  function  (return  value)  along 
with  a  set  of  arguments.  The  arguments  themselves  can  be  floating  point  or  integer  values,  single 
values  or  arrays,  pointers  to  functions,  and  combinations.  The  strength  of  a  high-level  language,  such 
as  C,  is  that  it  is  able  to  free  the  programmer  from  having  to  do  the  bookkeeping  involved  with  the 
arguments  (such  as  saving  the  registers  and  stack  location).  This  advantage  is  lost  when  dealing  with 
two  distinct  processors. 

When  developing  a  routine  for  a  host-controlled,  slave-mode  DSP  program,  the  programmer 
is  responsible  for  matching  the  arguments  between  two  different  processors,  and  controlling  the  child 
program  (see  Figure  7.1).  This  involves  loading  the  DSP  program  and  symbol  table,  finding  the  labels 
or  symbols  corresponding  to  the  arguments,  finding  the  addresses  of  these  symbols,  etc.  This  becomes 
tedious  and  prone  to  errors  unless  some  automation  tools  can  be  introduced. 
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A  formal  description  in  the  form  of  a  meta-language  or  software  description  language  (SDL) 
is  essential  to  simplify  automatic  program  development  and  to  improve  the  overall  program  reliability 
[Wirth  1976].  An  example  of  an  automated  approach  is  the  use  of  the  STAG^  macrogenerator 
[Waite  1973]  for  generating  DSP  interface  programs'^  STAGE2  is  essentially  a  template  matching 
program. 

An  example  of  a  template  input  that  we  have  successfully  used  in  creating  a  C  code  PC/DSP 
interface  module,  complete  with  a  correct  argument  list,  is  given  in  Listing  7.1. 


lUdefine  MAXELEMENTS  601 

FUNCTION : Pro jPursTwo(n_d,n_p, Data, x,Jj , iter, index, toler, Z) ; 
EXEC : a . out ; 

SET : float  Z ( n_d*n_p , MAXELEMENTS ) ; 

DOWNLOAD : int  n_d , n_p , J j ; 

DOWNLOAD : float  Data ( n_d*n_p , MAXELEMENTS ) ; 

DOWNLOAD : float  toler (2, 2); 

UPLOAD: int  i ter (2,2); 

UPLOAD : float  index (1,1); 

DOWNUP : float  X ( 2*n_p, 10); 

START; 

PROBE: Z, X, iter, index, errn; 

TASK:printf ("%d",errn) ; 

TASK:plotgraph(  index,  x,  iter,  Z  ); 

END; 


Listing  7.1  Projection  pursuit  SDL. 

This  file,  together  with  the  master  template  and  the  STAGE2  program,  was  used  to  interface 
the  host  PC  program  with  the  DSP  board.  In  this  case,  the  DSP  executable  program,  called  "a.out", 
was  designed  to  run  a  2D  projection  pursuit  algorithm  given  some  initial  data  supplied  by  the  PC. 
The  STAGE2  program  generated  the  interface  software  required  for  transferring  the  program 
arguments  (data  and  control  parameters)  between  the  PC  and  DSP.  In  this  example,  a  concurrent 
task  performed  by  the  PC  is  intermediate  plotting  of  the  2D  projection  plot  as  the  DSP  is  running. 

The  strength  of  the  approach  is  that  the  formal  syntax,  similar  to  that  used  in  the  VHDL 
language  or  in  Ada  [Cohen  1986]  (e.g.  download,  upload,  downup,  are  similar  to  in,  out,  inout  in 
Ada),  eliminates  inconsistency  errors  that  could  easily  occur  with  handcoding.  Further  benefits  of  the 
formalized  description  include  easier  checking,  clearer  description,  and  reduced  debugging  effort. 

The  SDL  syntax  is  contained  in  Listing  7.2.  The  argument  types  can  be  float  (float),  integer 
(int),  or  pointer  to  a  function  (function).  In  the  latter  case,  a  character  string  must  be  passed  to 
match  the  sym**  ’  table. 


The  AWK  language  (UNIX  utility)  is  a  similarly  structured  language  that  has  many  of  the  same  capabilities 
as  STAGE2. 
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idefine 

Any  preprocessor  directives 

FUNCTION: 

The  C  level  routine  along  with  its  arguments. 

EXEC: 

The  name  of  the  DSP  executable  file  calling  FUNCTION. 

SET: 

Declares  argument  type  and  dimenrion  (  e.g.  float  Zfnumber  elements,  max  dim) ) 

DOWNLOAD: 

Declares  arguments  to  be  downloaded  to  the  DSP. 

UPLOAD: 

Declares  arguments  to  be  uploaded  from  the  DSP. 

DOWNUP: 

Declares  arguments  to  be  downloaded  and  uploaded. 

START: 

Downloads  arguments  and  starts  the  DSP  program  EXEC. 

PROBE: 

Uploads  arguments  while  DSP  running. 

TASK: 

Runs  concurrent  task  on  the  host  while  DSP  is  running. 

END: 

Uploads  arguments  and  returns  from  C-level  routine. 

1  Other  syntax  statements  are  the  following.  | 

LOOP: 

Used  instead  of  START  to  call  FUNCTION  repetitively. 

CONTROL: 

Downloads  arguments  while  DSP  running. 

Listing  7.2  Syntax  for  software  description  language. 

By  invoking  STAGE2  on  a  Gle  containing  the  SDL  syntax,  a  C  module  containing  the 
interface  routines  initDSP_FUNCTION()  and  FUNCTION _DSP(args)  can  be  created. 

Example.  A  shorter  example  of  the  SDL  approach  allows  us  to  present  the  details  in  greater 
clarity.  In  this  example,  we  wish  to  have  the  DSP  perform  a  simple  function  call  with  one  argument. 
The  inunction  RunTest(x)  replaces  x  by  aq)(x).  The  DSP  interface  SDL  file  is  given  in  Listing  7.3 
("rundsp.stg")  while  the  C  module  containing  this  function'^  is  given  in  Listing  7.4  ("runtest.c").  Note 
that  the  array  size  forx  is  dimensioned  byx(l,l),  where  the  first  value  of  1  indicates  that  a  single 
value  is  downloaded  and  the  second  value  of  1  indicates  that  a  single  floating  point  memory  location 
in  the  DSP  must  be  allocated  for  x. 


Due  to  conflicts  regarding  argument  types  in  the  two  compilers  in  use  (Turbo  C  for  PC  and  AT&T  for  the 
DSP),  the  traditional  C  declaration  was  uniformly  used. 
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FUNCTION : RunTest ( x ) ; 
EXEC : a . out ; 

DOWNUP : float  x( 1 , 1 ) ; 
START; 

END; 


Listing  13  SDL  (named  ”rundsp^tg”) 
to  perform  simple  function  call. 


# include  <math.h> 

void  RunTest (x) 
float  *x; 

{ 

*x  =  exp ( *x ) ; 

} 


Listing  7.4  C  function  "runtestc". 


main( ) 

{ 

float  x=1 .0; 

initDSP_RunTest( ) ; 
RunTest_DSP ( &x ) ; 
RunTest (&x) ; 

} 


start 

"hOSt.exe"  , 

' 

’ 

finish 

F^re  7.1  Host  and  DSP  executable  files. 


Figure  7.2  SDL  compilation  process. 


Listing  7.5  C  module  "runmain.c"  to 
be  executed  by  host. 


The  first  step  is  to  create  the  interface  module  by  running  STAGE2  with  the  appropriate 
template  on  "rundsp.stg".  This  creates  the  C  module  "rundsp.c"  (see  Listing  7.6). 

To  create  the  DSP  executable  module  "a.out",  the  files  "rundsp.c"  and  "runtesLc"  are  compiled 
and  linked  with  the  DSP  C  utilities  (see  Figure  7.2).  To  run  the  PC  program  executing  "a.out",  the 
files  "runmain.c"  and  "rundsp.c"  are  compiled  and  linked  with  the  PC  C-language  utilities.  The 
process  is  complete  when  the  PC  program  calls  the  DSP  executable  "a.out"  during  runtime  (see 
Figure  7.1). 

By  examining  the  amount  of  code  generated  by  the  STAGE2  program  in  "rundsp.c"  (see 
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Listing.7.6)  and  comparing  to  the  formal 
specification  in  "rundsp^tg",  one  can  see  the  savings 
in  effort.  In  the  majority  of  the  DSP  programs 
coded,  we  have  used  this  description  language.  It 
has  considerably  reduced  development  time, 
particularly  for  the  algorithms  that  require  several 
arguments  to  be  passed.  For  example,  the 
projection  pursuit  description  of  Listing  7.1 
produced  approximately  100  lines  of  error-free 
code.  In  the  following  discussion  we  describe  more 
of  the  hardware  specifics  for  interfacing. 

7.1.2  Operating  System  Interface 

All  of  the  I/O  operations  for  data  transfer 
between  the  host  and  the  DSP  are  handled  by 
special  subroutines.  The  data  transfer  uses  I/O  port 
mapping  conventions,  and  data  is  transferred  by 
writing  or  reading  from  the  specified  ports.  No 
special  additions  to  the  operating  system  are  needed 
and  all  of  the  required  operations  are  easily  done 
with  the  presently  available  host  microprocessor  and 
DOS  commands. 

Since  the  DSP  is  capable  of  autonomous 
operation,  true  multitasking  is  possible.  By  properly 
scheduling  tasks,  a  substantial  improvement  in 
system  speed  can  be  achieved.  In  the  simplest  case, 
the  task  scheduling  can  be  handled  as  an  extension 
to  the  operating  system.  A  well  known  example  of 
this  is  the  Windows™  multitasking  environment. 

7.U  DSP  Program  Interface 

In  their  logical  structure,  DSP  programs 
follow  the  conventional  approach.  A  careful  design 
permits  an  easy  substitution  of  a  DSP  subroutine 
for  a  conventional  one. 

DSP  program  arguments.  DSP  program 
arguments  are  determined  by  the  selected  algorithm 
and  are  defined  as  part  of  the  program  module. 
Since  there  is  a  wide  variation  in  algorithms,  a 
single  universal  argument  sequence  cannot  be  easily 
established  and  each  algorithm  must  be  considered 
separately.  This  customized  approach  creates  a 
number  of  difficulties.  First,  in  conventional 
languages  the  argument  sequence  is  important.  This  means  that  the  procedure  must  use  an  argument 


#if  def1ned(N0_DSP) 

#1nc1ud«  <stdlTb.h> 

|lnc1ud«  <con1o.h> 

#1nc1ud«  "\tc\sws11b\dstruct.h" 

struct  DSPVAR  x_0SP; 
struct  OSPVAR  f1ag_0^; 
struct  OSPVAR  errn”DSP; 

void  InltOSP  RunTest(vold) 

{ 

static  Int  tr1al«0; 
d«fau'lt_addr(): 

if  (!dsp  dl_exec("a. out", trial)) 
«x1t7l)T 
dsp_run( ) ; 
trial*! ; 

X  OSP  •  f1nd_addr("x"); 
fT2ig_DSP  *  fTnd_addr  ("flag"); 
errn  DSP  »  f1nd~addr  ("errno"); 

) 

Int  RunTest  OSP(x) 
float  x[]; 

{ 

Int  start  «  1,  errn; 

/•  down-uploading  float  •/ 
setfloat(1,  X,  &  X  DSP); 
dlblock(4  x_0SP);  “ 
sat1nt(l  ,  4~erm,  S  errn_0SP); 
setIntO,  S  start,  S  flag  OSP); 
dlblock(»  flag_DSP); 
wh11e(dsp  done~flag(flag  0SP,addr)){ 
IfCkbUltOlC  ” 

If  (getchO  »«  'g'){ 
dsp_halt(); 

InltOSP  RunTestO; 
return(T); 

) 

) 

) 

upblock(S  x_0SP); 
return(O); 

} 

ilfendlf 

lUlf  !def1ned(N0_0SP) 

#1nclude  "\tc\swsl1b\swsfxn.h" 
float  x[l ]; 
ma1n( ) 

{ 

WaltUntilFlagO; 

ConvertDSP( 1 , x ) ; 

RunTest(x); 

ConvertIEEE( 1 , x) ; 

ResetToStart( ) ; 

) 

ijlendlf 


Listing  7.6  STAGE2  C  code  for 
"rundsp.stg".  Note  the  expanded  code  size. 
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sequence  identical  to  that  defined  in  the  calling  program.  Second,  even  if  default  arguments  are  used, 
the  same  argument  sequence  must  be  retained'*.  Although  a  variable  argument  list  is  available  in 
the  C  language,  good  error  checking  and  diagnostic  capability  is  not  a  simple  task.  However,  the  use 
of  the  STAGE2  SDL  can  alleviate  this. 

Specific  parts  of  the  interface  program  controls  data  transfer  and  the  DSP  computation 
process.  The  specific  DSP  program  arguments  include  the  following: 

Downloaded  input  data.  For  data  downloading  we  need  to  know  data  location,  data  type,  and 
array  size.  Note  there  are  two  distinct  locations  for  problem  related  data.  One  of  these  locations 
is  in  the  host  memory,  the  other  in  the  DSP  memory. 

Results  to  be  uploaded.  Before  results  can  be  uploaded  to  the  host,  data  location,  data  type, 
and  array  size  must  be  identified  and  then  the  necessary  commands  issued.  Floating  point  numbers 
are  converted  back  to  IEEE  format  before  they  are  uploaded  to  the  host  because  this  conversion  can 
be  performed  faster  in  the  DSP. 

Control  information.  Other  parameters  passed  to  the  DSP  will  include  the  number  of 
iterations  and  other  similar  control-oriented  information.  This  is  placed  in  a  data  block  where  it  can 
be  accessed  during  normal  DSP  operation. 

7.1.4  Detailed  Explanation  of  Main  Program  Tasks 

The  main  program  residing  in  the  host  processor  must  perform  a  multitude  of  tasks  that 
directly  affect  the  host  to  DSP  interface  (see  stages  II  and  El  of  Figure  7.3).  The  most  important 
of  these  tasks  are  described  below. 

Program  control.  The  host  program  controls  the  top  level  operations  performed  in  the  host 
CPU  and  in  the  DSP.  Typical  program  operations  include  reading  and  preprocessing  data, 
determining  task  sequence,  initiating  specific  tasks,  checking  task  completion,  etc. 

The  local  control  used  in  the  DSP  program  includes  iteration  control  and  other  similar  control 
operations  which  are  needed  for  the  specific  computation. 

DSP  setup.  Before  the  program  and  data  can  be  downloaded  to  the  DSP,  the  DSP  has  to  be 
set  up  for  data  transfer.  This  involves  issuing  the  specific  instructions  needed  to  initialize  the  DSP 
and  set  up  the  DMA  channel'*. 

Data  transfer  to  and  from  the  DSP  is  performed  in  the  DMA  mode  (block  transfer)  for  both 
the  DSP  program  and  the  data  to  achieve  the  best  efficiency.  DMA  transfer  involves  setting  up 
source  and  destination  addresses,  mode  of  transfer  and  block  length.  Once  the  setup  has  been 
completed,  block  data  transfer  is  automatic.  Note  that  these  data  addresses  must  be  absolute,  not 


For  each  statistical  function  it  should  be  investigated  if  the  program  arguments  can  be  expressed  as  data 
structures.  Should  this  be  feasible,  then  we  could  establish  the  following  structures:  input,  output,  and  control. 

'*  DSP  initialization  begins  with  a  reset  instruction  and  is  followed  by  the  operating  mode  setup. 
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symbolic.  Thus,  symbolic  addresses 
have  to  be  replaced  by  their 
absolute  components. 


Loading  DSP  program  and 
data.  The  DSP  program  for  the 
AT&T  DSP32  is  contained  in  a 
COFF  (Unix-type  Common  Object 
File  Format).  This  file  not  only 
contains  instructions  for  the  DSP 
but  also  contains  program  and  data 
labels  and  their  addresses.  These 
addresses  will  be  needed  to 
determine  from  which  DSP  memory 
locations  to  load  and  retrieve  data. 

Downloading  the  program  is 
a  relatively  simple  task,  because 
normally  all  of  the  program  will  be 
contained  in  one  or  two  data  blocks. 

The  program  data  locations  are 
available  in  the  COFF  header  and  the  program  data  are  contained  in  a  COFF  file  section.  If 
sufficient  memory  is  available,  then  all  of  the  needed  programs  could  be  stored  in  the  host  memory 
to  speed  up  data  transfer. 

Data  transfer  between  the  host  and  the  DSP  is  bidirectional  and  includes  downloading  data 
and  retrieving  results.  Faster  operation  could  be  achieved  by  blocking  all  of  the  information  as  a 
contiguous  block  with  well  defined  structure.  To  download  data  we  need  to  know  addresses  for  the 
source  data  location  in  memory  and  the  corresponding  location  n  he  DSP  (from  the  symbol  table, 
see  Appendix  F).  Direct  downloading  of  data  from  a  file  without  setting  up  detailed  data  arrays  in 
the  host  memory  could  be  further  investigated  because  this  approach  could  reduce  the  size  of  the 
host  program.  It  could  also  improve  the  speed  in  some  situations.  However,  if  sufficient  processing 
time  and  memory  space  is  available  in  the  host,  then  host-based  data  bufiering  is  preferable.  This 
means  that  the  host  can  read  in  the  next  data  file,  while  the  DSP  is  processing  the  previous  data. 

Starting  the  DSP  program.  DSP  program  execution  starts  from  memory  location  0  after  a  reset 
signal  has  been  issued.  This  means  that  the  program  at  location  0  should  contain  the  necessary  logic 
to  select  the  specific  program  blocks.  If  the  program  is  restarted,  then  the  same  conditions  will  apply. 
However,  it  is  also  possible  to  do  a  software  controlled  program  restart,  which  could  bypass  some  of 
the  initialization  steps. 

Concurrent  proc‘’ssing.  The  host  processor  can  perform  other  tasks  while  waiting  for  DSP 
computation  completion.  For  example,  the  host  could  perform  data  preprocessing,  file  updating,  or 
other  tasks  which  are  not  directly  affected  by  the  expected  results. 

Check  for  computation  completion.  The  check  of  DSP  status  involves  testing  of  the 
completion  flag  condition.  Note,  that  continuous  checking  is  not  needed  in  this  case,  because 
computation  results  do  not  have  to  be  retrieved  immediately.  In  this  respect  DSP  operation  differs 


Figure  7  J  Stages  of  DSP  program  development  and 

execution. 
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considerably  from  other  peripheral  devices,  such  as  communication  devices,  where  data  may  be  lost 
if  they  are  not  retrieved  immediately. 

Upload  results.  The  setting  of  the  DSP  completion  flag  indicates  that  all  of  the  computations 
have  been  completed  and  that  the  results  are  available  for  uploading  to  the  host  (this  is  similar  to 
downloading  excepting  the  data  direction).  The  symbol  table  contains  the  needed  addresses,  and  the 
data  block  size  is  available  from  the  host  program. 

Halt  DSP  operation.  If  the  computations  must  be  aborted  and  results  retrieved,  the  DSP  can 
be  suspended  by  issuing  a  halt  command. 

7.2  Hardware  Interface 


7.2.1  Processor  Interface 

The  main  system  bus  connects  the  host  processor  and  the  DSP  board.  System  throughput  can 
be  improved  by  increasing  the  system  clock  speed  or  by  using  a  faster  data  bus.  All  of  the  I/O 
transfer  between  the  host  and  the  DSP  is  via  the  host  bus  using  either  byte  or  16-bit  word  format 
Since  the  next-generation  DSP’s,  such  as  the  Motorola  96002,  will  support  a  32-bit  bus,  external  data 
transfer  will  be  greatly  improved'*. 

7.2.2  DSP  Interface  to  External  World 

In  addition  to  the  host  interface,  the  DSP  supports  a  built-in  external  serial  port  capable  of 
supporting  communications  between  processors  in  a  multiprocessor  environment  or  accepting  data 
in  a  real-time  data  collection  mode.  The  latter  type  of  application  would  be  highly  suitable  for  use 
in  statistical  process  control.  The  commercially  available  DSP  boards  that  have  a  serial  bus  typically 
use  this  interface  for  audio  (including  telephone,  speech,  etc.)  processing  applications  [Gorin  1986]. 

7.2J  Graphics  Interface 

The  host  system  bus  is  also  used  to  interface  the  graphics  display  controller.  Since  the 
conventional  graphics  boards  do  not  support  higher  level  graphics  operations,  the  coordinate 
transformations,  display  scaling,  and  data  conversion  (floating-point  to  integer)  must  be  performed 
by  either  the  host  processor  or  the  DSP.  Since  many  of  these  operations  can  be  done  efficiently  in 
a  DSP,  the  use  of  a  DSP  instead  of  a  high-performance  graphics  coprocessor  can  reduce  the  overall 
system  cost.  By  performing  these  operations  in  the  DSP,  a  speed  advantage  is  gained  because  of  the 
unique  operations  available.  For  example,  the  DSP  can  provide  floating  point  to  integer  conversion 
in  a  single  instruction  cycle.  The  data  transfer  rate  can  also  be  increased  because  it  takes  less  time 
to  transfer  a  fixed-point  number  than  a  floating-point  number  directly  to  the  graphics  processor. 

An  alternate  approach  uses  a  separate  graphics  processor  to  handle  the  statistics  workstation 
display  needs,  as  shown  in  Figure  7.4. 


'*  Note  that  the  internal  data  buses  are  already  supporting  32-bit  transfer. 
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7.2.4  Interrupts 


Although  the  DSP  is  capable  of 
handling  interrupts,  this  capability  is  not 
normally  needed  in  the  statistics  workstation 
operation,  except  where  real-time  operation  is 
desired  and  certain  data  must  be  processed 
immediately.  The  DSP  is  particularly  well 
suited  in  these  applications  because  it  is  capable 
of  supporting  very  fast  task  switching  (in  the 
microsecond  range). 

In  the  statistics  workstation  we  can 
distinguish  two  different  types  of  interrupts: 

Host-imtiated  interrupts.  Most  PC-based  systems  are  capable  of  supporting  fairly  sophisticated 
interrupt  structures.  Many  of  these  interrupt  structures  are  assigned  to  specific  DOS  tasks,  such  as 
disk  drivers  or  printers.  In  most  PC  systems,  hardware  is  available  to  handle  extra  user-defined 
interrupts.  These  interrupts  could  be  u^  to  perform  real-time  data  collection  and  processing. 

DSP  initiated  interrupts.  DSP  initiated  interrupts  could  be  used  to  signal  either  task 
completion  or  the  occurrence  of  some  error  conditions. 

7.2.5  Multiprocessor  DSP  Systems 

Most  commercial  DSP’s,  such  as  the  AT&T  DSP32,  are  suitable  for  use  in  a  multiprocessor 
environment  and  their  interfacing  does  not  present  any  special  hardware  related  problems. 


Figure  7.4  Graphics  interface. 
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8  PERFORMANCE  BENCHMARKING 


8.1  DSP  Benchmarking 

Performance  evaluation  of  the  DSP-based  statistical  algorithms  included  analytical  and 
computer  simulation  studies,  as  well  as  experiments  on  working  prototypes.  The  evaluation  included 
computation  timing  and  memory  size  requirements.  In  particular,  the  performance  of  the  lower  level 
algorithm  implementation  was  evaluated  in  detail. 

Relatively  few  timing  benchmarks  exist  for  statistical  problems.  Many  of  these  benchmarks 
are  based  on  data  sets  where  either  accuracy  is  important  [Wetherill  19SS]  or  correctly  classifying  data 
is  important  [Jain  1987].  Linear  algebra  benchmarks,  such  as  LINPACK,  are  only  partially  applicable 
to  statistical  problems.  Generic  benchmarks  such  as  Whetstones  and  Dhrystones  [Wilson  19^,  Price 
1989],  and  the  SPEC  benchmark  [Uniejewski  1989]  are  designed  to  measure  processor  speed. 
Therefore,  a  number  of  standard  statistical  data  sets  have  been  selected  as  a  basis  for  benchmarking. 
These  are  used  to  compare  the  proposed  statistics  workstation  design  against  conventional 
implementations. 

Where  appropriate,  an  analytical  or  simulation  approach  was  used.  Most  of  this  work 
involved  using  the  DSP  simulator,  because  it  provided  detailed  timing  information.  The  computation 
speed  estimates  included  algorithm  setup  time,  pipelining  constraints,  and  memory  access  time.  The 
low-level  timing  results  are  found  in  Appendix  E.  The  accuracy  evaluation  was  based  more  on  the 
experimental  work  and  comparison  to  conventional  microprocessors.  Since  the  most  accurate  overall 
performance  evaluation  can  be  conducted  only  in  the  full  operating  environment,  the  actual  DSP  and 
its  operating  environment  was  used. 

As  the  results  are  hardware  and  software  dependent,  we  first  give  a  short  description  of  the 
hardware  and  software  systems. 

8.1.1  Hardware  Configuration  and  Characteristics 

Most  of  the  Phase  I  development  was  done  on  an  IBM  compatible  386-type  personal 
computer  using  a  Micronics  motherboard  operating  at  20  MHz.  An  8  MHz  287  coprocessor  was  used 
for  floating  point  calculations.  Static  column  memory  (80  ns  access  time)  was  used  to  achieve  zero- 
state  delay. 

The  DSP  board  used  was  a  CAC  16  MHz  board  with  AT&T  DSP32  processor.  The  board 
selection  was  based  on  the  lowest  cost  and  the  easy  availability  of  the  support  software. 

8.1.2  Operating  System  and  Support  Software 

All  of  the  benchmarking  was  performed  under  the  Microsoft  MS-DOS  4.01  operating  system. 
Host  processor  software  was  developed  using  Borland’s  Turbo  C,  Version  2.0,  compiler.  The  DSP 
software  was  developed  using  the  AT&T  assembler  and  C  compiler  (and  Turbo  C  for  initial 
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debugging).  In  addition,  the  AT&T  and  CAC  DSP  32  utility  programs  and  libraries  (AT&T  1988] 
were  used. 

8.U  Benchmarit  Timing 

When  conducting  the 
benchmarking,  most  of  the  timing 
was  performed  on  the  application 
level.  For  the  selected  benchmark 
problems,  two  different  programs 
were  developed.  One  of  these 
programs  us^  only  the  PC,  while 
the  other  was  developed  to  perform 
the  computation-intensive  tasks  in 
the  DSP.  In  most  of  the  cases,  the 
same  C  modules  were  used  to  do 
the  computation  in  each  program  • 
only  the  compilation  and  interfacing 
differed.  The  timing  comparison 
was  based  on  the  actual  run  times  of 
the  two  programs,  using  the  same 
data  file  (see  Figure  8.1). 

For  a  finer  timing  it  will  be  necessary  to  identify  the  speciOc  phases  of  the  problem  and  the 
time  required  for  each  of  these  phases.  Knowledge  of  this  information  enables  further  improvements 
by  being  able  to  pinpoint  the  most  time  consuming  tasks'^ 

The  specific  time  functions  include  (c.f.  Figure  6.1): 

(1)  User  setup.  The  initial  setup  time  is  determined  by  the  user  selected  options.  This  time 
involves  reading  the  data  files  and  choosing  computing  options.  It  is  estimated  that  95%  of  total 
computation  time  is  devoted  to  the  user  input  and  therefore  speed  of  computation  may  not  be  a 
primary  issue  [Fridlund  1990]  during  data  setup.  On  this  basis,  5%  of  the  total  time  is  required  for 
the  actual  statistical  computation.  However,  as  the  computation-intensive  statistics  take  anywhere 
from  100  times  and  more  as  long  to  complete  as  the  traditional  methods,  speed  of  computation 
becomes  more  important. 

(2)  Data  and  program  setup  time.  This  setup  time  includes  the  initial  computational  setup  time, 
such  as  initialization  of  arrays,  preliminary  computations,  and  file  initialization.  The  program 
downloading  to  the  DSP  is  handled  by  a  utility  program.  Since  the  DSP  object  files  are  in  a  COFF 
format,  the  utility  program  must  extract  the  binary  code  and  then  download  it  to  the  proper 


Figure  8.1  Data  flow  for  host  and  DSP  program. 


Functional  partition  by  tasks  is  particularly  important  to  properly  evaluate  the  statistics  workstation 
performance.  The  task  partitioning  must  also  consider  concurrent  operation  of  DSP  and  host  because  the  most 
effective  mode  of  operation  occurs  if  overall  processing  time  can  be  reduced. 
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location".  The  time  it  takes  to  do  this  is  similar  to  the  time  for  loading  a  program  from  DOS.  This 
includes  some  overhead  plus  time  that  will  be  proportional  to  program  size.  Before  downloading 
begins,  the  absolute  memory  locations  for  data  transfer  are  obtain^  from  the  symbolic  data  labels 
in  the  COFF  file.  This  typically  will  only  have  to  be  done  once  for  a  given  algorithm. 

(3)  Consistency  checks  on  data  (Host  PC).  A  consistency  check  of  the  data  is  performed  in  the 
host.  This  operation  is  completed  before  the  actual  processing  begins. 

(4)  Downloading  of  data  (PC  -  DSP).  Downloading  of  the  data  depends  both  on  the  data 
volume  and  the  data  transfer  speed.  Of  these,  only  the  data  volume  can  be  controlled.  The  data 
transfer  speed  is  determined  by  the  hardware,  which  includes  data  bus  width,  clock  speed,  and  the 
speciflc  machine  instructions.  The  data  transfer  is  also  handled  by  the  utility  program.  Again,  ail  of 
the  needed  symbolic  information  can  be  extracted  from  the  COIT  file  before  this  is  done. 

(5)  DSP  confutation  tune.  Before  the  DSP  can  begin  actual  computations,  floating  point 
conversion  is  needed  between  the  lEEE-format  used  in  the  PC  and  internal  floating  point  format 
used  in  the  DSP  (see  Appendix  C).  A  substantial  speed  difference  exists  between  DSP32  and 
DSP32C  because  the  latter  can  perform  the  conversion  in  a  single  instruction  cycle.  We  can  expect 
that  the  need  for  this  conversion  will  be  eliminated  in  future  generation  DSP’s  because  these 
processors  are  being  designed  to  use  the  IEEE-type  floating  point  representation  [Motorola  1990]. 
The  DSP  computation  time  starts  when  program  and  needed  data  have  been  downloaded  and  ends 
when  all  of  the  DSP  calculations  have  b^n  completed  and  data  converted  back  to  IEEE  format. 

(6)  Host  computation  time.  This  time  includes  statistical  computations  performed  by  the  host 
processor. 

(7)  Uploading  data  (DSP  -  PC).  In  data  uploading,  most  of  the  same  considerations  apply  that 
were  discussed  in  connection  with  data  downloading.  However,  there  are  a  number  of  operations  that 
could  be  conducted  in  the  DSP  to  improve  the  overall  speed.  For  example,  if  the  output  data  is 
meant  for  display,  then  the  data  scaling  and  conversion  to  the  integer  format  can  be  performed  faster 
in  the  DSP  than  in  the  host.  Data  transmission  requirements  could  also  be  reduced  if  the 
experimental  data  can  be  expressed  as  16-bit  integers. 

(8)  Graphics  (PC  -  Monitor)  or  (DSP  -  Monitor).  In  the  initial  design,  all  of  the  graphics  display 
operations  are  handled  by  the  host.  It  is,  however,  possible  to  use  a  different  workstation 
architecture  in  which  the  display  subsystem  is  driven  directly  by  the  DSP.  This  approach  will  improve 
the  display  speed. 

8.1.4  Overhead  Evaluation 

For  all  the  overhead  factors  listed  above,  the  DSP  computation  must  compensate  by  being 
the  bottleneck  (i.e.  T^  >  Tp„^k»din»)-  To  achieve  the  greatest  improvement  in  processing  speed,  it 
is  essential  to  reduce  the  overhead  as  much  as  possible.  We  have  determined  that  very  little  cost  is 
associated  with  these  factors  for  our  test  cases. 


that  all  of  the  needed  information  is  available  in  this  file,  such  as  absolute  locations  and  block  size. 
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8.2  Low-level  Performance 


8.2.1  Optimized  vs.  Compiled 

To  evaluate  the  performance  of  the  optimized  DSP  library  routines,  comparisons  were  made 
between  these  routines  and  the  code  generated  by  the  DSP  C  compiler.  Table  8.1  gives  an  overall 
view  of  how  the  different  implementations  of  the  MAC  routine  (see  Section  S)  compare. 


Optimized 

C^e 

Compiled  using 
Pointers 

Compiled  using 
Arrays 

Code  Size 

20  bytes 

34  bytes 

37  bytes 

Number  of 
Instructions 

2N  +  18 

IIN  29 

26N  17 

Table  8.1  MAC  routine  implementation  and  performance. 

The  two  compiled  versions  given  in  the  table  were  coded  in  an  attempt  to  obtain  an  optimized 
compiled  version  of  DSP  code.  These  routines  do  not  include  the  variable  array  incrementing 
capability.  DSP  source  code  for  the  optimized  MAC  instruction  can  be  found  in  Appendix  A 

Table  8.1  shows  that  considerable  improvement  can  be  obtained  through  optimization  of  these 
small  routines.  In  most  cases  it  was  found  that  the  DSP  C  compiler  creates  approximately  IV2  to  3 
times  more  code  than  the  optimized  routines.  This  extra  code  usually  results  from  added  overhead 
and  including  precautionary  nops. 

The  best  test  of  performance  is  comparing  the  number  of  instructions  that  will  actually  be 
executed  when  the  routine  is  called.  When  the  loop  variant  N  is  large  (ie.  N  >  100),  the  optimized 
routine  will  executed  more  than  5  times  faster  than  the  routine  using  pointers,  and  approximately  12 
times  faster  than  the  routine  using  arrays  indices. 

In  addition,  the  optimized  routine  is  more  powerful  than  the  other  two  routines  given  because 
it  provides  a  variable  address  incrementing  capability  for  all  three  arrays,  as  opposed  to  a  constant 
increment  of  one  in  the  compiled  routines.  By  compiling  the  C  code  given  in  Appendix  A  for  the 
MAC  routine,  we  can  include  such  options.  The  result  is  that  the  DSP  C  compiler  creates  even  more 
overhead.  The  code  size  is  approximately  3  times  greater  than  the  optimized  code,  while  the 
execution  time  is  more  than  18  times  greater  for  N  >  100. 

FLOPS.  The  number  of  floating-point  operations  per  second  (FLOPS)  is  used  quite  often 
in  evaluation  of  computing  performance.  Table  8.2  shows  the  peak  performance  of  two  widely  used 
BLAS  routines.  These  peak  values  will  only  occur  for  N  very  large,  more  realistic  values  are  slightly 
less  than  those  given  in  the  table. 
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SAXPY 

(MFLOPS) 

SDOT 

(MFLOPS) 

DSP32  -  waits 

2.67 

3.2 

DSP32  -  no  waits 

4 

4 

DSP32C  -  waits 

12.5 

16.67 

DSP32C  -  no  waits 

25 

25 

Table  8.2  SAXPY  and  SDOT  performance. 

These  results  are  similar  those  given  for  the  same  BLAS  routines  in  [Harrod  1987],  The 
results  show  that  DSP  performance  approaches  the  performance  of  mini-supercomputers.  These 
performance  results  can  only  be  obtained  by  eliminating  wait  states,  and  using  optimized  routines. 
Once  compiled  DSP  code  is  introduced,  the  performance  measurements  degrade,  but  are  still  much 
better  than  conventional  microprocessors. 

Floating-point  instructions.  The  major  advantage  the  DSP  has  over  conventional 
microprocessors  is  that  it  can  execute  a  floating-point  instruction  in  one  instruction  cycle,  equivalent 
to  4  clock  cycles.  Even  by  including  a  math  coprocessor,  conventional  microprocessors  still  require 
considerably  more  clock  cycles  to  complete  a  floating-point  instruction.  For  example,  the  multiply- 
accumulate  instruction  in  the  DSP32  operating  at  16MHz  will  only  take  4  clock  cycles  (6  if  wait  states 
are  included),  while  it  will  require  approximately  1000  equivalent  386-processor  clock  cycles  for  a 
386/287  combination  running  at  20MHz”. 


Cycle  time  for  the  benchmarking  system  was  in  equivalent  386  (20  MHz)  clocks  since  the  287  coprocessor 
was  running  at  8  MHz. 
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Turbo  Debugger  Log 

CPU  80386 

cs:1922  88SEFA 

fflOV 

bx, [bp-06] 

cs:192S  01 E3 

Shi 

bx,  1 

cs:1927  01 E3 

Shi 

biXe  1 

bx,[bp^] 

cs:1929  03SE0A 

add 

cs: 192C  C03507 

fid 

dword  ptrfbx] 

cs:192F  880F 

fROV 

bx.di 

cs:1931  01 E3 

Shi 

bx,1 

cs:1933  01 E3 

Shi 

bx,1 

cs: 193S  035E06 

add 

bx,  [bp^] 

cs:1938  C03507 

fid 

dword  ptr[bx] 

ca:1938  C03AC9 

fmulp 

st(1 ),st 

cs;193E  C03546FC 

fid 

dword  ptr[bp-04] 

cs:1942  C03AC1 

faddp 

st(1),st 

cs:1945  C0355EFC 

fstp 

dword  ptr[bp-04] 

cs:1949  C030 

fwalt 

cs:1948  037E08 

add 

dl,  [bpiOS] 

cs:194E  88460C 

mov 

ax.ibpfOC] 

cs:19S1  0146FA 

add 

[bp^],ax 

cs:1954  46 

1nc 

S1 

cs:1955  387604 

onp 

s1.[bp*04] 

cs:1958  7CC8 

1922 

sdotl:  If  (r3 —  »0)  goto  sdotl 

aO  -  aO  ‘rZ-t-t-rie  •  •l■♦++rl7 


Listing  SDOT  inner  loop,  DSP  code. 


Listing  8.1  SDOT  inner  loop,  80x86  code. 


Listings  8.1  and  8.2  compares  the  386/287  assembly  code*  for  the  main  loop  of  SDOT  with 
the  DSP  assembly  code^'.  In  this  example  it  is  easy  to  see  that  the  number  of  assembly  instructions 
is  much  less  in  the  DSP.  Furthermore,  the  DSP  register  transfer  language  style  appears  much  more 
readable  than  the  386/287  mnemonics. 

From  the  reference  manuals  for  the  386  and  287  processors,  we  Hnd  the  it  takes 
approximately  1062  clock  cycles  to  complete  one  loop  of  the  SDOT  routine.  In  contrast,  the  DSP32 
will  take  8-10  clock  cycles  to  complete  the  loop,  while  the  DSP32C  will  only  take  4-6  clock  cycles. 

A  similar  performance  improvement  occurs  when  trying  to  optimize  the  C  code  for  Homer’s 
algorithm  (see  Appendix  A).  Listings  8.3  through  8.6  demonstrates  the  evolution  of  the  DSP  code 
optimization  process  and  gives  a  comparison  against  the  PC  80x86  assembly  language  code.  Listing 
8.3  shows  the  optimized  C  code  for  the  algorithm  that  relies  on  pointer  addressing.  Listing  8.4  shows 
the  DSP  code  compiled  from  Listing  8.3.  The  much  more  compact  Listing  8.5  gives  the  hand 
optimized  DSP  code.  As  a  comparison.  Listing  8.6  gives  the  substantial  80x86  code  compiled  from 
the  C  code  of  Listing  8.3. 


*  A  386  C  compiler  was  not  available  during  this  stage,  so  that  the  8086/8087  compilation  mode  was  used. 
We  do  not  expect  much  of  a  difference  in  either  mode. 

Note  that  in  Figure  8.S,  the  DSP  automatically  does  the  next  instruction  after  encoimtering  a  conditional 

branch. 
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float  HORN  (  N,  COEF.  X  ) 

1nt  N; 

float  C0EF[].  X; 

{ 

roglstar  Int  1; 

ragistar  float  hom,  •coaf,  x; 

coaf  •  COEF; 

X  a  X; 

hom  «  •coaf++; 

1  -  N  -  3; 
do 

hom  ■  •ooaf++  +  hom  •  x; 
«h11a  (1 —  >«  0); 
ratum(hom); 


Listing  Homer’s  C  code. 


.global  HORN 
HORN: 

•rl4++  «  rl3 
•rl4++  a  rl2 
•rl4++  a  a2  a  a2 
•rl4++  a  a3  a  a3 
nop 

rl2  a  rl4  -  20 
rl2  a  ^12 
rl  a  r14  -  24 
a2  a  -rl 
a3  a  *^2++ 
rl  a  r14  -  16 
rl  a  ^rl 
nop 

rl3  a  rl  -  3 
LIS: 

a3  a  ^12++  >  a3  «  a2 
nop 
L14: 

if  (rl3 —  >=  0)  goto  LIS 
nop 
LI  3; 
aO  a  a3 
goto  LI 2 
nop 
L12: 

rl4  a  rl4  -  12 
rl3  a  ^14++ 
rl2  a  ^14++ 
a2  a  *^4++ 
a3  a  ‘rM-M- 
return  (r18) 
r14  a  rl4  -  12 


Listing  8.4  Compiled  DSP  code. 


.global  HORN 

HORN: 

r14  a  r14  -  12 

a1  a  •r144-fr19  /•  X  Input  •/ 

r3  a  •r14++r19  /•  COEF  •/ 

r2  a  •r14++rl9  /•  N  •/ 

aO  a  ^3++ 

r2  a  r2  -  3 

homi: 

nop 

if  (r2 —  >^)  goto  homi 

aO  a  nrSt-f  ■*-  aO  *  a1 

homo: 

return  (r18) 

nop 

Listing  8.5  Optimized  DSP  code  for  Homer’s  algorithm. 


HORN:  float  HORN  (  N,  COEF.  X  ) 


“  cs:01FA  55 

push 

bp 

cs:01FB  88EC 

mov 

bp.sp 

cs:01FD  83EC08 

sub 

SP.0008 

cs:0200  56 

push 

si 

cs:0201  57 

push 

di 

cs:0202  C0394608 

fid 

qword  ptr[bp408] 

cs:0206  C0355E08 

fstp 

dword  ptr[bp+08] 

es:020A  C030 

fwait 

H0RNI8:  coof  >  COEF; 

cs:020C  887606 

mov 

si,  [bpfOB] 

H0RNI9:  x  a  X; 

cs:020F  88560A 

mov 

dx, [bp+OA] 

cs:0212  884608 

mov 

ax, [bp+08] 

cs:0215  8956FE 

mov 

[bp-02], dx 

cs:0218  8946FC 

mov 

[bp-04}, ax 

HORNilO:  hom  a  •coef++i 

1 

cs:021B  885402 

mov 

dx, [si*02] 

cs:021E  8B04 

mov 

ax, [si] 

cs:0220  8956FA 

mov 

[bp-06], dx 

cs:0223  8946F8 

mov 

[bp-08], ax 

cs:0226  83C604 

add 

si,  0004 

HORNIII:  1  a  N  -  3; 

cs:0229  8B7E04 

niov 

di,[bp+04] 

cs;022C  83C7FD 

add 

di.FFFO 

H0RN#13;  horn  «  •coef++ 

♦  horn 

•  x; 

cs:022F  CD3546F8 

fid 

dword  ptr[bp-08] 

cs:0233  C03546FC 

fid 

dword  ptr[bp-04] 

cs:0237  C03AC9 

fmulp 

st( 1 } , St 

cs:023A  CD3S04 

fid 

dword  ptr[si] 

cs:0230  CD3AC1 

faddp 

st( 1 ) , St 

cs:0240  C0355EF8 

fstp 

dword  ptr[ bp-08] 

cs:0244  CD3D 

fwait 

cs:0246  83C604 

add 

si,  0004 

H0RN#14:  while  (1—  >.  0); 

cs;0249  8BC7 

mov 

ax,di 

cs:024B  4F 

dec 

di 

cs:024C  OBCO 

or 

ax.u 

cs:024E  7DDF 

Jnl 

HORN#! 3  (022F) 

H0RN#15;  return(horn); 

cs;0250  CD3546F8 

fid 

dword  ptr[ bp-08] 

cs:0254  EBOO 

jmp 

HORN#! 6  (0256) 

H0RN#16:  ) 


Listing  8.6  80x86  code  for  Homer’s  algorithm. 
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8.2.2  Summary  of  the  Statistics  Workstation  Low*Level  Routine  Performance 

Appendix  E  and  Table  E.1  gives  information  on  the  execution  of  the  low  level  BLAS  and 
BSAS  routines  provided  in  the  statistics  workstation  library.  The  code  for  these  routines  has  been 
optimized  to  provide  the  best  possible  execution  times.  The  information  in  the  table  was  formulated 
from  DSP  source  code  and  by  using  the  DSP  simulator,  which  provided  a  profile  of  the  code. 

8.3  Results  of  Computation-intensive  Algorithm  Comparisons 

The  test  cases  described  below  were  chosen  to  be  a  representative  sampling  of  computation¬ 
intensive  statistical  methods.  Most  of  the  modules  were  coded  in  the  C  language  for  maximum 
portability  between  the  PC  and  DSP.  As  the  C  compiler  was  not  available  during  the  early  course 
of  this  effort,  several  of  the  examples  were  written  in  DSP  assembly  language. 

In  general,  we  observed  that  hand-coding  in  DSP  assembly  language  and  using  BLAS  and 
BSAS  routines  within  a  C  environment  produced  similar  performance  figures.  These  were  usually 
well  above  the  performance  of  the  strictly  C  written  routines.  However,  properly  written  C  routines, 
using  incrementing  pointers  and  other  methods,  were  able  to  routinely  improve  the  performance  by 
50%. 


The  timing  for  each  case  was  against  a  PC  with  and  without  a  coprocessor.  Timing  steps  4, 
S,  and  7  in  Section  8.1.3  were  used  in  measuring  DSP  performance  while  step  6  alone  was  us^  in 
gauging  PC  performance.  Figure  9.4  shows  the  performance  improvement  of  the  various  algorithms. 
The  BSAS  and  BIAS  routines  used  for  several  algorithms  are  given  in  Table  4.1.  As  discussed 
further  in  Section  9,  including  the  low-level  routines  improved  performance  greatly. 


8J.1  Correlation  coeCficient  (bootstrapped). 
program  CC 

This  test  case  was  based  on  the  example  by  Diaconis  and  Efron  [Diaconis  1983]  for  illustrating 
the  bootstrap  technique  on  a  relatively  simple  statistic,  the  correlation  coefficient  in  Equation  8.1. 

E  (  -  *  )  (  -  y  ) 


r  = 


Eq.(8.1) 


E  ( E  ( ■  y 

i  >  I 


For  testing  purposes,  the  data  set  of  the  above  reference  (GPA  and  SAT  scores  from  15 
students  being  admitted  to  various  law  schools)  was  used.  By  today’s  standards  of  PC  computing 
power,  calculating  several  hundred  bootstraps  from  this  particular  sample  is  not  too  formidable  a  task 
[Noreen  1989).  However,  as  the  number  of  parameters  or  cases  increase  beyond  this  level,  the 
computation  time  will  increase  correspondingly.  For  this  reason,  the  development  on  a  DSP  was 
deemed  worthwhile. 
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Since  this  routine  was  coded  in  DSP  assembly  language,  we  expected  the  maximum  speedup 
over  the  PC  version.  The  random  number  generation  was  provided  by  the  ran  routine  in  the  AT&T 
library. 
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Figure  8.2  Correlation  coefficient  bootstrap  display. 


For  the  law  school  data  set,  we  introduced  a  third  random  variable  in  which  to  compare  the 
correlation  coefficient.  The  results  of  a  simulation  of  30,000  bootstraps  on  this  sample  is  shown  in 
Figure  8.2.  The  GPA  is  designated  A,  the  SAT  score  is  B,  and  random  number  is  C.  We  expect  high 
correlation  between  A  and  B,  but  not  between  A  and  C  or  B  and  C. 

The  timing  for  100  bootstraps  for  both  the  PC  and  DSP  is  shown  in  Table  8.3.  As  can  be 
observed,  the  performance  improvement  over  the  coprocessor  configured  PC  is  approximately  30 
times.  For  larger  data  sets,  the  improvement  is  more  substantial,  increasing  to  35  times  for  a  set  of 
40  cases.  This  is  due  to  fewer  calculations  of  the  square  root  compared  to  multiply  and  accumulates. 


8J.2  Multiple  linear  regression  using  SVD  (bootstrapped). 
program  SVD 

Singular  value  decomposition  (SVD)  is  a  powerful  and  widely  used  technique  for  solving  least 
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386 

386/287 

DSP32 

5  s 

1.6  s 

0.052  s 

Tabk  liming  for  bootstrapped  correlation  coefficient 

squares  problems.  Its  power  stems  from  the  ability  to  produce  solutions  for  cases  when  equations  are 
very  close  to  singular.  For  an  overdetermined  system,  SVD  produces  the  best  approximation  in  the 
least  squares  sense,  while  in  the  underdetermined  case  it  produces  the  smallest  values  in  the  least 
squares  case. 

Two  sources  were  referenced  when  creating  the  SVD  routine,  the  first  being  the  UNPACK 
User’s  Guide  [Dongarra  1979]  and  the  other  being  Press  [1986].  Both  sources  contained  routines  for 
SVD  coded  in  Fortran  and  C  respectively.  The  two  versions  are  similar  in  some  respects,  however, 
the  UNPACK  version  was  much  more  complex  due  to  the  variety  of  decomposition  options  given 
to  the  user®. 

The  routine  for  singular  value  decomposition  given  in  [Press  1986]  performs  decomposition 
on  any  MxN  matrix,  where  M  is  greater  than  or  equal  to  N.  The  routine  decomposes  the  matrix  into 
three  matrices,  returning  two  orthogonal  matrices  along  with  one  diagonal  matrix.  The  size  and 
complexity  of  this  routine  was  much  less  than  LINPACK’s  SVD,  in  addition,  the  routine  was  provided 
in  C  code.  For  these  reasons,  this  routine  was  chosen  to  be  coded  on  the  DSP. 

The  original  format  for  the  SVD  routine  was  inadequate  for  easy  compilation  to  DSP  code, 
and  changes  to  the  code  had  to  be  made.  Originally  all  arrays  were  defined  with  1  as  their  initial 
starting  index,  and  all  loop  variants  began  from  1  and  ended  with  N.  Thus,  to  be  consistent  with  the 
BLAS/BSAS  applications,  all  array  indices  and  loop  variants  were  adjusted  to  range  from  0  to  N-1. 
In  addition,  all  dynamically  allocated  space  and  2  dimensional  arrays  were  converted  to  row  major  1 
dimensional  arrays  for  use  in  DSP  code. 

During  coding,  comparisons  were  made  with  the  LINPACK  version  to  include  as  many 
BLAS/BSAS  routines  as  possible.  All  of  the  BIAS  routines  used  in  the  LINPACK  version  are  also 
used  in  the  coded  version,  excepting  of  SSWAP.  In  addition  to  these  subroutines,  ^ne  following 
subroutines  were  also  included:  SCALCPY,  SASUM,  FILL,  SCOPY. 

Two  separate  versions  of  SVD  were  used  to  compute  the  regression  coefficients  of  the 
Longiey  data  set  found  in  [Wetherill  1985].'  The  first  version  was  the  original  [Press  1^]  version 
with  no  BIAS  or  BSAS  routine,  the  other  was  the  optimized  version  using  the  library  routines.  The 
data  was  passed  to  SVD  as  a  16x7  matrix,  and  the  results  of  the  decomposition  were  passed  to  a 
back-substitution  routine  to  determine  the  regression  coefficients.  The  coefficients  calculated  by  both 
the  C  versions  and  DSP  versions  agreed  very  well  with  the  expected  results  (see  Table  8.4). 


^  The  routine  found  in  LINPACK  contains  a  number  of  parameters  which  allow  the  user  to  select  the  format 
of  the  decomposition.  Because  of  its  size  and  con^lexity  this  routine  was  not  chosen  to  be  coded  for  the  DSP. 
However,  it  was  used  as  a  reference  for  implementing  the  BLAS/BSAS  routines. 
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regression 

coefficients 

Exact 

SVD  (DSP) 

SVD  (  386  single¬ 
precision  ) 

xl 

15.061872271373 

15.0748 

15.0706 

x2 

-0.035819179292 

-0.0358134 

-0.0358197 

x3 

-2.020229803816 

-2.02015 

-2.02022 

x4 

-1.033226867173 

-1.03323 

-1.03321 

x5 

-0.051104105653 

-0.0511362 

-0.0510876 

x6 

1829.151464613551 

1829.04 

1829.12  1 

Table  8.4  Results  of  Longley  benchmark. 


To  establish  a  timing 
comparison,  the  same  data  set  was 
bootstrapped  100  times  to 
determine  the  variability  of  the 
regression  coefficients.  Additional 
routines  were  used  to  perform  the 
bootstrapping,  such  as,  ran  and 
bootstrap.  The  results  of  this 
execution  timing  are  given  in  Table 
8.5. 

Because  of  additional 
overhead,  this  table  does  not  give  an 
accurate  comparison  of  the  SVD 
routines  themselves.  However,  the 
table  does  supports  the  use  of 
BLAS  and  BSAS  routines  as  a 
means  to  further  improve  the  speed 
of  the  DSP  coded  routines. 


BO  UnmcOHM  C  ■  BLAS/BSAS 


Table  8.5  SVD  timing. 


833  Autoregressive  model  (bootstrapped). 


program  AR 

Bootstrapping  an  autoregressive  model  can  lead  to  an  estimate  of  the  predictive  error  or  error 
in  the  coefficients  if  iid  noise  is  assumed  [Efron  1986].  For  this  method,  the  residuals  or  noise  term 
from  the  model  (e  in  Equation  8.2)  must  be  bootstrapped. 

For  this  example,  a  maximum  entropy  method-based,  autoregression  algorithm  from  [Press 
1986]  was  converted  to  low-level  subroutines  for  DSP  use.  Fcr  timing  purposes,  a  15  pole  model  was 
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Eq.(8^) 


N 

y*  =  yn-j  ^  e, 
y-» 


applied  to  a  300  point  data  time-series.  The  coefHcients  were  then  used  to  predict  30  future  points 
and  the  prediction  error.  The  timing  results  are  shown  in  Table  8.6. 


386 

386/287 

DSP32 

Unmodified  C 

1120  s 

252  s 

37.25  s 

BLAS/BSAS 

1200  s 

283s 

5.33  s 

Table  8.6  AR  model  bootstrapped  timing. 

F.'om  the  speedup,  the  AR  algorithm  is  ideal  for  DSP  and  for  low-level  subroutine 
optimization.  This  results  mainly  from  a  MAC  operation  with  positive  and  negative  indexing  that 
performs  an  operation  similar  to  a  convolution  on  the  data  array^. 


Figu'. ..  8 J  Time-series  data. 


Figure  8.4  Autoregressive  model  prediction. 


For  a  smaller  data  set,  taken  from  [Newton  1988],  the  results  are  shown  in  Figures  8.3  and 
8.4.  Here,  the  error  is  given  by  the  root-mean-square  deviations  of  the  bootstrapped  predictions. 

To  make  this  method  more  applicable  requires  modifying  the  poles  of  the  AR  model 
coefficients  to  be  within  the  unit  circle.  To  do  this  effectively,  low  level  routines  that  consider 


^  The  original  unmodified  code  produced  a  compile-time  error  at  the  MAC  stage,  the  only  serious  error 
observed  from  the  AT&T  C  DSP  compiler.  This  was  eliminated  by  using  the  MAC  low-level  routine. 
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complex  number  arithmetic  may  need  to  be  introduced.  Fortunately,  there  are  several  examples  of 
DSP  routines  that  use  complex  number  structures  [AT&T  1988]. 

8J.4  ID  and  2D  projection  pursuit 

program  PPl  and  PP2 

A  projection  pursuit  (PP)  algorithm  as 
described  by  [Friedman  1987]  was  tested  for 
DSP  applicability.  The  PP  algorithm  was 
designed  to  detect  departures  from  normality  of 
a  multidimensional  data  cloud.  The  results  of 
the  algorithm  give  one  or  two-dimensional 
projections  of  the  data  that  exhibit  strong 
tendencies  for  clustering  (see  Figure  8.S). 

Further  applications  of  the  algorithm  to  the 
renormalized  data  give  new  projections. 

The  blocks  of  the  algorithm  are  shown 
in  Figure  8.6  at  the  end  of  this  chapter.  It 
features  a  quasi-Newton  optimization  technique 
[Fletcher  1987]  for  minimizing  the  projection 
index  along  with  orthonormality  constraints  on 
the  projections,  which  gives  the  departure  from 
normality.  At  the  lowest  level  of  the  routine, 
there  are  dot  products  for  calculating 
projections,  evaluation  of  error  function,  and 
calculating  the  projection  index  and  its  derivative. 

Our  main  emphasis  was  on  applying  the  more  complicated  2D  algorithm.  This  gave  a  good 
test  of  the  DSP’s  capabilities  as  it  pushed  code  size  (30K)  to  nearly  the  limit  of  the  DSP32  chip  (but 
not  DSP32C).  Fortunately,  an  optimized  error  function  routine  was  included  in  the  AT&T  C  DSP 
library  (the  error  function  routine  from  Press  [1986]  was  used  for  the  PC  version^).  Apart  from  this 
routine,  the  DSP  and  PC  version  used  the  same  C  code.  The  SDL  for  this  routine  is  shown  in 
Section  7. 

The  results  of  the  timing  tests  for  both  routines  is  shown  in  Table  8.7.  Since  the  technique 
is  iterative  and  only  stops  when  the  error  term  drops  below  a  certain  value,  the  timing  per  iteration 
is  shown.  The  Iris  data  set  (150  cases,  4  dimensions)  was  used  for  testing  [Becker  1988]. 

Not  surprisingly,  there  were  departures  in  the  solution  paths  the  PC  and  DSP  version  of  the 
algorithm  took  to  finding  a  local  minimum  of  the  first  projection  index.  However,  the  final  minimum 
were  nearly  equivalent  in  the  two  cases  (see  Figures  8.7  and  8.8  and  invert  the  x-projection),  as  were 
the  total  number  of  iterations.  In  both  cases,  only  the  projections  corresponding  to  the  first 


^  Unfortunately,  this  error  function  is  not  very  optimized  in  that  it  features  many  levels  of  function  calls  to 
lower-level  routines. 


Figure  8.5  ID  projection  that  maximizes 
clustering. 
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projection  pursuit  solution  are  shown. 


386 

386/287 

DSP32 

ID  (BLAS/BSAS) 

70  s/iter 

17  s/iter 

0.7  s/iter 

2D  (BLAS/BSAS) 

130  s/iter 

32  s/iter 

1.6  s/iter 

Table  8.7  Projection  pursuit  timing. 
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Figure  8.7  Projection  pursuit  result  on  Iris  data.  Figure  8.8  Projection  pursuit  result  on  Iris  data. 
386  version.  DSP  version. 


8J.5  Markov  modeling. 
program  MM 


Markov  modeling  plays  an  important  part  in  reliability  and  maintainability  predictions,  as  well 
as  queuing  applications.  As  such,  it  is  more  a  probability  application  than  a  statistics  application.  It 
has  been  included  here  to  test  the  applicability  of  DSP’s  to  the  integration  of  linear  and  nonlinear 
differential  equations  (see  e.g.  Equation  8.3). 

The  Markov  model  solution  technique  chosen  for  demonstration  is  matrix  free  and  relies  on 
an  adaptive  step,  4*  order  Runge-Kutta  integration  algorithm.  The  DSP-version  of  the  algorithm  was 
written  in  assembler  code  to  maximize  the  speed  (very  few  BSAS  or  BLAS  routines  are  required  in 
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dt 


Eq.(8J) 


dP^it) 

dt 


€tC. 


the  algorithm).  For  most  linear  problems,  the  integration  proceeds  quickly  given  that  the  transition 
rates  are  not  too  far  apart  For  stiff  problems,  where  the  rates  vary  widely,  the  integration  is  slower. 
One  such  application,  as  shown  by  the  state  diagram  in  Figure  8.9,  is  in  maintenance  where  the 
failure  rate  (B2=A.)  is  low  but  the  repair  rate  (Bl=p)  is  high.  The  results  of  the  timing  are  shown 
in  Table  8.8  for  this  application. 

For  this  particular  problem,  roundoff  errors  are  important  for  long  integration  times.  For 
single  precision,  the  accuracy  of  the  result  degrades  as  the  ratio  between  B2  and  B1  increases  (for 
single  precision  this  must  be  less  than  ~10^). 


386 

386/287 

DSP32 

1800  s 

4%s 

16  s 

Table  8.8  Markov  model  timing. 


Figure  8.9  State  diagram  for  repairable  system. 


To  test  applicability  of  DSP  solution  for  nonlinear  Markov  model  problems,  a  predator-prey 
system  was  also  modeled  [Gardiner  1983J.  This  is  an  example  of  a  Volterra-type  model  [Sarkar  1987] 
which  is  known  to  be  very  sensitive  to  initial  conditions  and  coefficients.  Figure  8.10  was  calculated 
by  the  DSP  according  to  the  simple  relationship  in  Figure  8.10.  The  oscillations  observed  in  this  case 
allow  a  comparison  to  the  cyclic  Lynx  (predator)  data  used  in  Section  8.3.3  and  Figure  8.3  for 
autoregressive  prediction.  These  curves  demonstrate  that  the  DSP  is  useful  as  a  general-purpose 
scientiflc  computation  tool  where  both  statistical  forecasting  and  modeling/simulation  techniques  are 
needed. 
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Figure  8.10  Nonlinear  Markov  model  predator-prey  plot. 


Figure  8.11  State  diagram  for  predator-prey  system. 


83.6  Iterative  techuiques  (MacKay’s  &  SOR). 
program  M 

Several  iterative  techniques  were  tested  on  the  DSP.  The  first  is  a  matrix  pseudo-inversion 
algorithm  from  [MacKay  1981].  A  typical  application  of  a  pseudo-inversion  algorithm  is  in  finding 
multiple  regression  coefficients  for  an  overdetermined  data  set  as  shown  in  Equation  8.4. 

The  pseudo  inverse  is  calculated  by  repeated  iterations  of  B  against  A  according  to  Equation 

y  =  A  x,  AeR"**  Eq.(8.4) 

8.5,  where  I  is  the  identity  matrix  of  size  mxm.  The  algorithm  requires  a  good  initial  guess  of  B  to 
converge  quickly.  This  is  given  in  more  detail  in  the  above  cited  reference. 
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=  (2  /  -  A)  . 


B€R 


Eq.(8.5) 


MxN 


The  timing  results  for  inverting  a  matrix  of  numbers  taken  from  the  Longley  benchmark  are 
shown  in  Table  8.9.  Approximately  50  iterations  were  required  for  the  results  to  converge,  while  100 
were  taken  for  timing. 


386 

386/287 

DSP32 

Unmodified  C 

41.19  s 

9.78  s 

- 

BLAS/BSAS 

40.9  s 

10.17  s 

0.146  s 

Assembler 

- 

- 

0.13  s 

Table  8.9  Iterative  matrix  inversion  timing. 


When  comparing  this  technique  against  conventional  techniques  for  inversion  (such  as  SVD), 
the  iterative  techniques  perform  slower.  Furthermore,  unless  pre-computation  data  centering  is 
applied,  the  DSP  solution  accuracy  suffers. 

The  strong  advantage  that  the  DSP  technique  offers  is  the  speed  of  computation.  This  is  not 
surprising  since  the  algorithm  is  rich  in  matrix  multiplies  and  other  BSAS  routines.  This  allows  the 
DSP  to  run  at  peak  efficiency  throughout  the  routine. 

program  SOR 

Gauss-Seidel  iteration  is  a  useful  technique  for  solving  nonlinear  sets  of  equations,  which  may 
occur  in  projection  pursuit  regression  or  other  methods.  An  improvement  on  the  technique  is  given 
by  the  method  of  simultaneous  over-relaxation  (SOR). 

Listing  8.7  shows  a  C  code  fragment  taken  from  the  inner  loop  of  an  SOR  algorithm  taken 
from  Press  [1986].  The  algorithm  operates  on  a  two-dimensional  array  u  and  equation  coefficients 
a,  b,  c,  d,  e,  f.  As  written,  the  code  is  not  optimized  for  DSP  use  since  too  many  array  indexing 
references  are  required.  To  optimize  this  code.  Listing  8.7  is  modified  to  the  code  fragment  in  Listing 
8.9,  which  uses  pointer  referencing.  The  compiled  DSP  assembly  language  multiply-accumulate 
portion  of  this  c(^e  is  shown  in  Listing  8.8.  Note  that  by  converting  the  equation  coefficients  (a  - 
f)  to  an  array  A  further  condenses  the  code  and  thus  makes  it  highly  optimal  without  resorting  to 
tedious  hand  assembly  coding. 
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if  ((j*l)X2  —  ntt)  ( 

rMi1d>a[  J  ]  [  1  ]*u[  J>1  ]  [  1  ] 

rwld  +-  d[Jl[l]MJl[l-ll 

•norm  *•  fabs(r*s1d); 
u[j][1]  -«  onieg**r«a1d/etj]t1]! 

) 


Listing  8.7  Original  SOR  C  code 


a2  -  •  ^94+ 

a2  >  a2  >  •  *rOM 

a2  -  a2  >  •rlO*+  •  •r?** 
•2  >  a2  +  «r10^  • 
a2  >  a2  >  •  •ri 

a2  >  a2  -  •rlO— 


Listing  8.8  DSP  MAC  portion  of  SOR 


Listing  8.9  Pointer  converted  SOR  C  code 


Without  going  into  detail  about  the  statistical  application  of  the  technique,  we  can  also  demonstrate 
the  performance  figures  for  this  code.  Table  8.10  compares  the  PC  and  DSP  performance  for  a 
1 1x1 1  array  and  1000  iterations. 


386 

386/287 

DSP32 

76.7  s 

18.6  s 

1.21  s 

Table  8.10  SOR  timing. 


8J.7  Density  estimation. 

program  DE 

Density  estimation  as  discussed  in  [Silverman  1986]  is  a  recently  introduced  method  that 
benefits  greatly  from  improvements  in  computer  performance  and  algorithm  enhancements.  This 
technique  uses  an  FFT  to  simplify  the  convolution  of  the  density  histogram  with  the  windowing  kernel 
(see  Equation  8.6).  It  has  applications  in  smoothing  a  bootstrap  and  in  empirical  Bayesian 
calculations. 

For  this  example,  the  algorithm  in  (Griffiths  1985]  was  converted  from  FORTRAN  into  C  and 
uses  the  corresponding  BLAS/BSAS  routines.  In  addition,  the  FFT  routine  was  provided  by  the  DSP 
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Pix)  =  fAx-t)w(t)dt 


Eq.(8.6) 


For  this  example,  the  algorithm  io  [Griffiths  1985]  was  converted  from  FORTRAN  into  C  and 
uses  the  corresponding  BLAS/BSAS  routines.  In  addition,  the  FFT  routine  was  provided  by  the  DSP 
applications  library.  The  PC-version  FFT  was  adapted  from  [Press  1986].  The  windowing  kernel  was 
assumed  to  be  normal  with  a  user  adjustable  width. 

The  results  of  the  timing  analysis  are  given  in  Table  8.11  for  a  256  point  FFT  and  250  data 
points.  A  total  of  50  iterations  were  measured  to  improve  accuracy  of  timing. 


386 

386/287 

DSP32 

164  s 

40s 

1.6  s 

Table  8.11  Density  estimation  timing. 

The  speed  advantage  provided  by  the  DSP  resides  in  the  FFT  routine.  In  particular,  by 
removing  the  normal  kernel,  a  speedup  of  50%  is  seen.  Other  kernels,  such  as  the  Epanechnikov 
kernel  [Silverman  1986],  will  improve  performance  further. 

Density  estimation  also  has  some  applications  in  areas  where  fast  data  collection  is  needed. 
Figures  8.12  and  8.13  shows  samples  of  density  estimates  of  noise  being  introduced  through  the  DSP 
board’s  analog  audio  input.  The  data  is  updated  several  times  a  second  with  the  PC  handling  all  of 
the  graphics. 
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Figure  8.12  Density  estimation  of  audio-frequency  noise.  Large  window. 


Figure  8.13  Density  estimation  on  audio  frequency  noise.  Small  window. 


83.8  Survival  analysis  (Kaplan-Meier  estimate). 

program  SUR 

Nonparametric  survival  estimators  such  as  the  Kaplan-Meier  algorithm  (see  Listing  8.10)  have 
often  been  used  in  computation-intensive  applications  such  as  bootstrapping  [Efron  1986,  Grier  1988, 
Akritas  1986].  In  the  latter  reference,  much  work  was  done  on  vectorizing  the  low-level  code  so  the 
algorithm  can  be  run  efficiently  on  a  supercomputer,  thereby  saving  valuable  computer  time^. 

The  optimization  of  the 
algorithm  for  DSP  use  has  little 
resemblance  to  that  described  in 
[Grier  1988].  For  one,  vectorizing  is 
not  warranted  in  the  DSP.  It  not 

only  is  difficult  to  do  with  the  Kaplan-Meier  algorithm, 

present  DSP  capabilities,  but  it  also 


»1(1)  .  (ALIVE{1)  -  0IED(1))  /  ALIVE(l) 

DO  10  I  «  2,  MJMOBS 

10  KH(I)  -  md-l)  •  (AUVE(I)  -  0IED(I}}  /  AUVE(I) 


^  Even  then,  for  the  survival  data  they  analynd,  the  computation  time  amounted  to  6  hours.  We  used  a 
different  data  set  for  timing. 
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adds  considerable  memory  overhead.  In  the  supercomputer  example,  B  copies  of  the  data  set, 
corresponding  to  B  bootstraps,  were  stored  in  memory  before  the  estimate  was  performed.  For  large 
data  sets,  and  B  >  100,  this  can  become  a  large  memory  requirement  For  supercomputers  with 
gigabytes  of  storage  and  no  cost  to  the  user  apart  from  CPU  time  this  is  a  cost  effective  way  to 
proceed. 

For  the  DSP, 
however,  we  approach  the 
problem  differently.  The 
technique  that  we  use  to 
optimize  the  Kaplan-Meier 
estimate  for  DSP  use  is  to 
convert  the  divisions  (Listing 

8.11)  to  multiplications  and 
then  use  the  MAC  type 
instructions  to  do  the 
multiplications  with 
automatic  indexing  (Listing 

8.12) .  Since  a  division  is 
more  time  consuming  than  a 
multiplication  (for  both  a 
DSP  and  conventional  microprocessor),  ail  the  divisors  are  stored  in  an  array  that  is  precomputed  and 
then  used  over  the  many  bootstraps. 


void  kmdlvC  npts,  csnsor,  r«sult  ) 
irrt  npts; 

float  c«nsor[].  r-asuHl]; 

{ 

rogistar  float  *o1dresult,  one>1.0,  remain; 

remain  >  (float)  npts; 
oldresult  >  result; 
npts  -»  3; 

•result++  «  (one  -  •censor++  /  remain — ); 
do 

Veault-w-  »  *o1dresu1t-t-t-  •  (one  -  •censors-*-  /  remain — ); 
while  (npts —  >«  0); 


Listing  8.11  Kaplan-Meier  coded  with  division. 


void  km(  npts,  censor,  result,  divisor  ) 

Int  npts; 

float  censor[],  result[],  d1v1sor[]; 

{ 

register  float  •oldresult_p,  *consor_p, 
•d1v1sor_p,  •result_p;  ~ 

register  float  one-1.0,  temp; 
register  int  count; 

oldresult_p  *  result; 
censor_p  »  censor; 
divisor  p  =  divisor; 
result_p  =  result; 
count  3  npts  -  3; 

•result  p*-*-  *  one  -  ‘censor  pt-t-  •  ‘divisor  p— ; 
do  ”  ~  ~ 

{ 

temp  =  one  -  •oonsor_p*+  •  •d1visor_p— ; 
•remit  pM-  =  •oldremit  p**  *  temp; 

} 

while  (count —  >=  0); 


void  generatod1v(  npts,  divisor  ) 

Int  npts; 

float  d1visor[]; 

{ 

register  int  1; 

for  (1=1;  1<=npts;  1++) 

•dlvlson-t-  =  1.0  /  (float)  1; 

) 


void  km_est(  npts,  data,  censor,  result  ) 
int  npts; 

float  data[],  consor[],  result[]; 

{ 

float  d1v1sor[SIZE]; 

generated iv(npts, divisor); 

km( npts, censor, result,&divisor[ npts -1 ]); 

) 


Listing  8.12  Kaplan-Meier  coded  with  multiplication. 


After  compiling  this  to  pseudoassembler  language  we  can  further  optimize  by  eliminating 
nops.  Table  8.12  gives  the  timing  for  100  loops  of  the  code  in  Listing  8.12  (computational  results 
were  the  same  for  all  three  processors).  The  input  data  set  contained  62  cases,  (taken  from  [BMDP 
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1985]  p.562).  In  a  more  realistic  example,  which  may  involve  bootstrapping  and  factorial  design  as 
Grier  demonstrated,  we  do  not  expect  as  great  a  performance  improvement. 


386 

386/287 

DSP32 

2.6  s 

0.6  s 

0.01  s 

Table  8.12  Kaplan-Meier  timing. 


8.3.9  K-means  clustering  (bootstrapped). 

program  KM 

The  K-means  clustering  algorithm  is  a  simple  technique  used  for  separating  a  set  of  N  cases 
in  P-dimensions  into  K  clusters.  It  involves  the  steps  of  initially  separating  the  cases  into  a  set  of 
seed  clusters,  computing  the  cluster  means,  and  then  rearranging  the  cases  to  the  closest  cluster  mean 
[Hartigan  1985,  BMDP  1985].  This  repeats  until  no  further  changes  are  made  and  the  within-cluster 
sum-of-squares  is  minimized,  and  another  value  of  K  can  be  chosen. 

As  the  number  of  dimensions  increases,  the  algorithm  loses  eBectiveness  due  to  a  large  search 
space  and  the  possibility  of  encountering  a  local  minimum.  Bootstrapping  applied  to  the  initial  data 
set  allows  estimates  of  the  variability  of  the  K-means  method  to  be  made.  This,  however,  will 
increase  computation  time  greatly  (9  hours  for  250 6-D  patterns  on  a  superminicomputer)  [Jain  1988]. 

We  have  adapted  the  FORTRAN  code  in  [Hartigan  1985]  to  C  with  the  BLAS/BSAS 
extensions  to  test  the  performance  the  algorithm  in  a  DSP  environment.  The  lowest  level  of  the 
algorithm  is  dominated  by  calls  to  the  DIST  function.  This  returns  the  Euclidean  distance  squared 
between  any  two  points  (see  Appendbc  A).  Since  for  two-dimensions,  the  function  call  overhead  is 
a  large  percentage  of  the  computation  time,  we  expect  that  the  computation  speed  will  improve  for 
higher  dimensions.  This  is  shown  in  Figure  8.14  for  NxP= constant. 


386 

386/287 

DSP32 

Unmodified  C 

64.6  s 

16.3  s 

3.68  s 

BL.AS/BSAS 

72.6  s 

18.4  s 

0.77  s 

Table  8.13  K-means  timing. 


The  results  of  a  timing  comparison  between  the  DSP  and  PC  performance  is  shown  in  Table 
8.13  for  P=16  and  N=28  (computational  results  were  the  same  for  all  three  processors). 

This  set  of  cases  is  not  a  good  application  of  K-means  (the  number  of  points  is  too  small 
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K-means  clustering  speed  comparison 
DSP  vs  386 


c 
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DSP  opt 
386/287  opt 
386  opt 
DSP  no  opt 
386/287  no  opt 

W 

386  no  opt 


Figure  8.14  Performance  versus  dimension  of  K-means  solution  space. 


compared  to  the  dimension),  however,  for  larger  sets  of  data,  the  speed  advantage  becomes 
considerable. 

8J.10  Kendall’s  tau 

program  KT 

Kendall’s  Tau  is  a  nonparametric  correlation  technique  which  is  useful  when  the  probability 
distribution  function  from  which  data  is  drawn  is  not  necessarily  known.  Nonparametric  correlation 
replaces  the  data  values  by  their  rank  in  respect  to  ail  the  other  data.  The  major  advantage  of  such 
techniques  is  that  when  a  correlation  is  present  nonparametrically,  then  it  really  exists  [Press  1986]. 
The  disadvantage,  however,  is  that  since  it  discards  information  by  producing  a  rank  order,  it  may 
sometimes  fail  to  find  an  existing  correlation. 

Unlike  other  nonparametric  correlations,  Kendall’s  Tau  does  not  require  that  the  data  be 
sorted  and  ranked.  Instead  it  uses  the  relative  ordering  of  ranks.  This  is  done  by  comparing  all  pairs 
of  data  points,  checking  the  relative  ordering  of  ranks,  and  incrementing  and  decrementing  counters 
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on  rank  tests. 


The  source  for  the  Kendall’s  Tau  algorithm  was  taken  from  [Press  1^6]  and  modified  to  allow 
the  DSP  compiler  to  create  more  efficient  code.  The  major  modifications  were  the  use  of  pointers 
instead  of  anays  with  indices.  Additional  modifications  replaced  the  "for'*  statement  with  the  "do 
while"  construct  to  eliminate  loop  overhead. 

Although  these  modifications  helped  the  DSP  compiler  create  efficient  code,  an  extra  step  was 
taken  to  manually  optimize  the  DSP  assembly  code.  This  step  involved  eliminating  unnecessary  nops 
and  interleaving  unrelated  instructions.  Through  this  process  the  original  code  performance  was 
improved  by  26%. 

Table  8.14  shows  the  execution  times 
for  100  iterations  of  Kendall’s  Tau  on  the 
different  processors.  The  speedup  factor  for 
this  routine  is  not  as  large  as  others  mainly 
because  the  hand  written  BLAS/BSAS  routines 
are  not  contained  in  the  algorithm.  In  addition, 
the  full  potential  of  the  DSP  is  not  being  used 
because  the  routine  contains  mostly  integer 
operations.  Thus  we  expect  this  comparison  to  show  mainly  the  speedup  due  to  the  optimized 
instruction  set  and  pipeline  effects. 

8J.11  Bayesian  bootstrap  (integration  by  Simpson’s  rule). 

program  BB 

The  bootstrap  method  can  be  also  applied  in  Bayesian  analysis.  One  of  most  frequent 
criticisms  for  Bayesian  analysis  is  the  use  of  subjective  prior  information,  while  the  choice  of  the  error 
distribution  is  seldom  challenged.  Boos  and  Monahan  (1^6)  proposed  to  use  a  bootstrap  method 
incorporating  prior  information  which  performs  well  without  direct  knowledge  of  the  error 
distribution. 

The  first  step  is  to  estimate  the  distribution  function  of  the  data  using  the  empirical 
distribution  function  F,  of  the  observations.  Next,  generate  B  random  samples  of  size  n  from  and 
calculate  the  statistic  of  interest  from  sample  j.  Then  from  the  B  simulated  estimates  of  the  statistic 
of  interest,  compute  the  kernel  density  estimator.  Finally,  calculate  the  posterior  distribution  for  the 
statistic  of  interest  by  using  Simpson’s  rule  as  a  numerical  integration  method. 

For  faster  computations,  we  employ  the  Epanechnikov  kernel  [Silverman,  1986]  instead  of 
using  the  normal  kernel  for  the  density  estimation.  This  increases  the  DSP  performance  advantage 
over  the  PC  implementation  by  a  factor  of  5  times.  If  the  normal  kernel  is  used,  less  advantage  is 
realized  because  of  the  frequent  subroutine  calls^  and  slow  function  evaluation,  as  the  exponential 
calculation  is  done  in  software. 


1  386 

386/287 

DSP 

1  87s 

21  s 

2s 

Table  8.14  Kendall’s  tau  timing. 


^  To  show  the  flexibility  of  DSP  programming,  the  Epanechnikov  subroutine  call  was  passed  by  pointer. 
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Table  8.15  shows  the  execution  times  for  Program  BB  on  the  different  processors 
(computational  results  were  the  same  for  all  three  processors).  The  evaluation  includes  20  bootstrap 
replications  of  the  median  of  a  one-dimensional  50  point  data  set  (Example  1  [Boos,  1986]).  The 
DSP  has  some  advantages  over  the  conventional  processor.  However,  we  believe  that  the  DSP  will 
have  greater  applicability  for  multidimensional  data  since  more  vector  operations  are  needed. 


386 

386/287 

DSP32 

64.3  s 

14.5  s 

1.53  s 

Table  8.15  Bayesian  bootstrap  timing. 


8.3.12  Neural  networks  for  discrimination. 
program  NN 

Neural  networks  (NN)  have  received  considerable  attention  lately.  Because  of  their  self- 
learning  capabilities,  they  have  applications  to  statistics,  particularly  in  situations  where  trends  are  not 
discernible  by  other  techniques.  In  this  way,  it  shares  some  similarities  to  projection  pursuit  [Interface 
1986].  Neural  networks  are  also  computation-intensive  as  most  of  the  training  and  learning  is  the 
result  of  summing  and  multiplying  operations. 

The  statistical  neural  net  chosen  for  DSP  demonstration  is  adapted  from  the  probabilistic 
neural  net  (PNN)  described  in  [Specht  1990].  The  claim  of  the  PNN  algorithm  is  that  it  is  much 
faster  than  other  NN  techniques.  However,  on  closer  examination,  it  is  very  much  similar  to  the 
density  estimator  described  earlier  with  a  Bayes  decision  rule  applied  for  discrimination.  Silverman 
[1986]  describes  this  approach  further.  The  difference  in  the  PNN  approach  is  that  the  densities  are 
not  calculated  at  once,  but  are  calculated  (in  the  neural  network  approach)  by  associating  each  point 
with  every  other  point.  The  density  estimator  for  the  PNN  assuming  a  normal  kernel  is  given  in 
Equation  8.7. 


/ 

p' 

1 

Xf-Xj 

-  X  exp 

yj2%a 

K 

v/2a 

J 

Eq.(8.7) 


Our  version  of  the  technique  uses  the  DIST  function  as  the  only  low-level  BSAS  routine. 
The  results  of  the  timing  tests  for  the  PNN  using  a  normal  kernel  is  shown  in  Table  8.16 
(discriminational  results  were  the  same  for  all  three  processors).  In  this  example,  as  in  the  previous, 
the  choice  of  kernel  has  a  large  impact  on  speed. 

83.13  Euclidean  distance  measurement 

program  CL 

This  measurement  was  taken  from  a  statistic  used  to  analyze  two-dimensional  point 
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[ 

386/287 

DSP32  1 

1  63  s 

16.3  s 

0.66  s  1 

Table  8.16  PNN  timing. 

distributions  of  defects  occurring  during  the  semiconductor  wafer  manufacturing  process  [Pukite,  in 
press].  The  computation-intensive  part  of  the  algorithm  is  very  similar  to  that  of  the  K-means  and 
PNN  algorithms  in  that  a  Euclidean  distance  is  calculated.  This  is  repeated  for  each  pair  of  defects 
observed,  giving  N(N-l)/2  total  calculations.  The  DSP  version  of  the  C  code  had  few  low-level 
subroutines  and  so  was  further  optimized  by  hand.  Table  8.17  gives  the  performance  results  for 
N=430.  The  speedup  over  the  conventional  processor  was  limited  by  the  lack  of  true  array 
operations  and  the  square  root  calculation. 


386 

386/287 

DSP32 

137  s 

31.1  s 

2.75  s 

Table  8.17  Euclidean  distance  measurement. 


8J.14  Stochastic  simulation. 
program  ST 

This  was  included  to  test  a  reverse  polish  parsing  routine  that  may  be  applicable  for  user- 
defined  Monte  Carlo  simulations.  Designing  this  routine  within  the  DSP  presented  no  real  problems. 
Unfortunately,  since  the  parser  is  rich  in  integer  and  string  type  computations  it  is  not  as  suitable  for 
DSP  use.  In  addition,  the  coprocessor  version  did  not  show  as  large  a  speedup  as  the  other 
algorithms.  As  an  alternative  approach,  efficiency  can  be  improved  by  compiling  these  operations 
before  runtime  with  a  stripped  down  compiler  [Korn  1989]. 


386 

386/287 

DSP32 

45  s 

15  s 

1.4  s 

Table  8.18  Parsing  of  formulas  timing. 


8.3.15  Hypothesis  testing 

There  are  several  other  computation-intensive  statistical  applications  that  we  have 
implemented  on  a  DSP.  In  particular,  [Noreen  1989]  gives  several  examples  of  programs  featuring 
shuffling,  Monte  Carlo  simulation,  and  bootstrapping  for  testing  statistical  hypothesis.  Many  of  the 
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programs  feature  statistics  that  are  specifically  designed  for  the  data  at  hand.  In  this  case,  new  low- 
levei  subroutines  need  to  be  implemented  that  may  not  be  among  those  listed  in  Appendix  A  The 
performance  of  one  such  example,  given  by  Program  2.4  in  [Noreen  1989]  is  given  in  Table  8.19  along 
with  that  Noreen’s  performance  evaluation  on  a  Macintosh  11  system  under  different  program 
compilers. 


386 

386/287 

DSP32 

Mac  n,  Basic 
[Noreen 
1989] 

Mac  n.  Fort 
[Noreen 
1989] 

Mac  n,  Pasc 
[Noreen 
1989] 

549  s 

125  s 

6.66  s 

214  s 

105  s 

559  s 

Table  8.19  Shuffle  statistic  timing. 
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Projection  Pursuit 


9  SW  PERFORMANCE/COST  EVALUATION 


This  section  presents  a  detailed  statistics  workstation  performance  and  cost  evaluation.  This 
evaluation  is  based  on  statistical  program  benchmarking  data  presented  in  Section  8.  The  key  tradeoff 
factors  include  hardware  availability,  total  workstation  cost,  and  expected  workstation  performance. 

Before  proceeding  with  the  details  of  performance  and  cost  evaluation,  we  will  present  our 
rationale  for  selecting  the  DSP  as  a  processor.  Since  DSP’s  are  not  used  for  conventional  statistical 
computations,  their  use  in  statistical  applications  is  often  questioned.  These  key  questions  are  usually 
phrased  as; 

o  Why  not  use  a  conventional  high-speed  processor  instead  of  a  DSP? 

o  Why  not  use  a  numeric  coprocessor  instead  of  a  DSP? 

o  Is  the  extra  effort  needed  for  developing  DSP  software  worth  the  increase  in  development 
cost? 

o  What  is  the  overall  cost-effectiveness  of  the  proposed  statistics  workstation  and  how  does  it 
compare  to  a  distributed  network  of  computers  or  other  parallel  processors? 

o  How  will  the  future  advancements  in  microprocessor  design  affect  the  use  of  DSP’s  for 
statistical  computations? 

Answers  to  these  questions  are  presented  in  this  section,  starting  with  an  overview  and 
followed  by  a  detailed  discussion. 

9.1  Conventional  Processors  versus  DSP 

The  majority  of  the  conventional  microprocessors  have  been  designed  primarily  for  integer 
and  string  computations.  Only  the  most  recent  microprocessors,  such  as  Intel  486  and  Motorola 
68040,  incorporate  floating  point  capabilities. 

The  advantages  of  using  conventional  microprocessors  include  their  wide  availability,  low  cost, 
and  excellent  software  support.  However,  conventional  microprocessors  are  not  only  slow  in 
p>erforming  floating  point  computations,  but  also  in  supporting  advanced  array  indexing  operations. 
The  slow  floating  point  processing  speed  is  due  to  the  need  for  software  emulation  of  floating  point 
operations  (if  a  coprocessor  is  not  available). 

DSP’s,  on  the  other  hand,  have  been  developed  for  supporting  fast  floating  point  operations 
and  concurrent  array  indexing.  Their  floating  point  processing  speed  is  at  least  one  order  of 
magnitude  higher  than  that  of  the  486  microprocessor  (which  operates  at  ~1  MFLOPS). 

9.2  Math  (Numeric)  Coprocessors  versus  DSP 

Widely  available  math  coprocessors  include  the  80x87, 60881,  etc.  They  are  offered  as  options 
to  the  basic  system  at  an  extra  cost  of  several  hundred  dollars,  depending  on  the  system  clock  speed. 
The  earlier  coprocessors  (8087  and  80287)  offer  a  3  to  5  times  speed  improvement  over  the  stand¬ 
alone  8088  and  80286  CPU.  The  more  recent  coprocessors  offer  a  10  times  advantage  for  the  80387 
and  up  to  20  for  the  80486  (with  internal  487  floating  point  capability).  This  improvement,  however. 
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can  only  be  achieved  when  a  substantial  number  of  floating  point  computations  are  involved. 


Although  the  math  coprocessors 
perform  floating  operations  in  hardware,  they 
need  many  clock  cycles  to  perform  the  basic 
floating  point  operations.  In  addition,  a  math 
coprocessor  must  communicate  with  the  CPU  to 
gain  bus  access  before  it  can  begin  to  perform 
an  operation  (see  Figure  9.1).  Thus,  the  host 
processor  must  grant  the  bus  to  the  coprocessor 
and  then  initiate  the  floating  point  operations. 

DSP’s,  on  the  other  hand,  can  perform  floating 
point  multiplication  and  addition  in  a  single 
instruction  cycle.  This  capability  leads  to  the 
speed  improvement  quoted  above. 

In  addition,  one  is  restricted  to  a  single  coprocessor  add-on  per  conventional  microprocessor. 
To  further  increase  speed,  the  only  route  to  take  is  to  enhance  the  performance  or  clock  rate  of  the 
conventional  microprocessor/coprocessor  combination.  This  is  in  contrast  to  a  multiple  DSP 
approach. 

One  key  advantage  of  a  numeric  coprocessor  is  the  ease  of  integration,  as  most  of  the  major 
language  compilers  support  standard  numeric  coprocessors.  Many  higher  language  compilers  also 
have  a  feature  to  detect  the  absence  of  a  coprocessor  and  evoke  emulation  during  runtime.  Another 
advantage  of  numeric  coprocessors  over  the  present-generation  DSP’s  is  their  double  precision 
floating  point  computation  capability. 

Thus,  if  extensive  floating  point  operations  are  needed  or  a  multi-processor  environment  is 
envisioned,  then  the  DSP  provides  a  more  cost-effective  solution. 

9.3  Digital  Signal  Processor  Tradeoffs 

A  detailed  discussion  of  the  available  DSP  devices  is  provided  in  Appendix  D.  Although 
these  devices  differ  in  their  physical  implementation,  they  are  very  similar  with  respect  to  the 
available  floating  point  operations. 

The  DSP  architecture  borrows  heavily  from  the  supercomputer  architecture.  Some  of  these 
features  include  pipelining,  multiple  address  and  data  buses,  etc.  They  can  also  be  considered  as  a 
reduced  instruction  set  computer  (RISC)  processor  specialized  for  highly  repetitive  operations  [HPS 
1990,  p.  26].  Because  of  their  unique  architectures,  DSP’s  have  certain  advantages  and  disadvantages 
when  used  in  computational  applications.  These  must  be  clearly  understood  if  an  optimum 
application  of  these  devices  is  desired.  Even  though  the  disadvantages  of  DSP  operation  may 
outnumber  the  advantages,  the  performance  issue  is  still  key.  This  is  similar  to  a  supercomputer 
calculation,  which  has  speed  and  memory  advantages,  but  little  else.  If  the  user  needs  these 
capabilities,  the  high-performance  system  is  still  the  best  choice. 

The  statistics  workstation  in  the  end  will  not  work  as  a  single  processor.  It  will  combine  the 


Figure  9.1  Math  coprocessor  data  path. 
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strengths  of  the  conventional  microprocessor  with  those  of  DSP’s  to  create  a  very  powerful  system. 
Many  of  the  DSP  advantages  and  disadvantages  were  mentioned  earlier.  A  summary  of  these  are 
presented  below. 

9J.1  Advantages  of  using  DSPs 


The  one  overriding 
advantage  that  the  DSP  has 
over  conventional 
microprocessors  is  its  speed 
in  floating  point 
computations.  The  low 
price  of  the  DSP  gives  it  a 
further  cost/performance 
advantage. 

DSP’s  are  ideally 
suited  for  floating  point 
number  addition, 
subtraction,  and 
multiplication.  Further 
advantage  is  gained  with 
these  operations  if 
automatic  index 
incrementing  is  feasible. 

Multiply  accumulate  (MAC) 
is  the  most  powerful  DSP 
operation.  In  the  DSP32, 
one  MAC  operation 
requires  4  clock  cycles.  If  the  same  instruction  was  implemented  in  a  conventional  microprocessor, 
such  as  80x86,  it  would  require  several  instructions,  each  requiring  many  cycles.  For  the  DSP32C, 
the  speed  advantage  over  the  Intel  80486/487  is  ~25  for.  this  instruction.  The  more  advanced 
DSP32C  is  also  much  faster  for  a  variety  of  BSAS  level  routines  than  the  DSP32  used  in  this  study 
(see  Figure  9.2). 

DSP  programming  is  no  more  difficult  than  programming  conventional  microprocessors  with 
the  support  of  a  high-level  support  language  such  as  C.  DSP’s  are  also  satisfactory  for  logical 
operations,  with  the  of  majority  of  these  (except  for  bit  operations)  supported  in  current  DSP’s. 

The  higher  level  of  the  DSP  assembly  instructions  makes  it  easier  to  read  the  code  and  debug 
the  program.  This  is  particularly  true  for  mathematical  applications.  Programmed  correctly,  many 
high-level  C  expressions  (such  as  multiply-accumulate)  will  compile  to  a  single  DSP  instruction.  This 
is  an  improvement  over  the  conventional  microprocessor. 

Finally,  whereas  the  numeric  coprocessor  requires  continuous  intervention  by  the  host 
processor,  the  DSP  can  operate  in  an  autonomous  mode  after  the  program  has  been  downloaded. 
This  means  that  the  DSP  can  be  assigned  a  particular  computation  task  with  full  local  authority.  This 
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Figure  9,2  DSP32C  speedup  over  the  DSP32  used  in  this  study  for 
a  variety  of  subroutines. 
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capability  makes  it  possible  to  do  parallel  operations  using  multiple  DSP’s,  thus  improving 
performance  further.  The  simplest  example  of  concurrency  is  simultaneous  operatkin  of  the  PC  and 
DSP,  each  running  separate  tasks.  In  the  master  and  slave  mode  of  operation,  statistical  computation 
tasks  are  divided  between  host  and  DSP.  An  optimum  mix  is  needed  to  achieve  the  best 
performance. 

An  alternate  implementation  could  use  the  serial  data  link  to  communicate  between  the 
individual  DSP’s.  In  this  case,  neither  the  host  processor,  nor  the  data  bus  is  involved  in  data  transfer 
operations.  Thus,  such  a  system  could  greatly  reduce  the  data  transfer  load  handled  by  the  host 
processor.  Through  parallel  operation,  many  C^P’s  can  perform  specific  operations  and  use  the  high 
speed  serial  data  link  for  intercommunications. 

9J.2  Disadvantages  of  using  DSPs 

The  present  generation  of  floating  point  DSP’s  are  32  bit  machines  and  as  such  they  use 
single  precision.  To  improve  computation  accuracy  we  must  use  techniques  such  as  data  centering, 
double  pass,  grouping  of  variables,  centering,  and  sorting  the  values  preceding  the  summation  [Thisted 
1988].  Although  these  operations  differ  from  the  conventional  set  of  digital  processing  operations, 
they  can  be  efficiently  coded  to  operate  in  a  DSP.  Note  that  many  of  these  techniques  were 
developed  for  use  on  the  early  minicomputers  which  only  supported  single-precision  floating  point 
Although  the  single  precision  accuracy  presents  a  limitation,  many  statistical  problems  do  not  always 
require  a  higher  accuracy  because  of  inacurracies  in  the  initial  data  values. 

Moreover,  the  single-precision  floating  point  limitation  is  only  a  short  term  problem,  as  the 
next-generation  DSP’s  are  already  extending  floating  point  accuracy.  For  example,  the  Motorola 
96002  has  extended  single  precision  capability  and  the  Intel  i860  uses  a  double  precision  IEEE 
floating  point  standard. 

Operational  dependencies  of  compound  instructions  (see  Figure  6.5)  are  difficult  to  handle 
in  the  DSP.  This  problem,  due  to  the  pipelining  effect,  is  not  unique  to  the  DSP’s  as  it  is  also 
evident  in  supercomputers.  The  key  difference  is  that  the  delays  are  handled  automatically  in 
supercomputers,  whereas  the  DSP  programmer  is  responsible  for  handling  the  pipeline  constraints 
due  to  the  lack  of  automatic  delays.  The  present-^ay  DSP  compilers  provide  delays  that  are 
conservative  to  avoid  any  pipeline  conflicts. 

DSP’s  are  poor  for  integer  multiplications  of  other  than  2.  This  deficiency  applies  only  to  the 
current  generation  of  floating  point  DSP’s,  such  as  the  DSP32.  In  the  next-generation  DSP’s,  such 
as  Motorola  96002,  full  integer  operation  capability  will  be  available,  including  integer  multiplication. 

DSP’s  do  not  provide  direct  instructions  for  floating-point  division,  square  root,  and 
transcendental  functions.  These  operations  must  be  done  in  software.  When  selecting  division 
algorithms  we  can  trade  accuracy  for  speed.  In  some  applications  this  tradeofl  may  be  acceptable  and 
could  be  used  during  the  early  phase  of  iteration,  with  higher  accuracy  used  during  the  final  stages. 

DSP’s  are  poor  for  string  and  character  handling  because  they  do  not  have  special  hardware 
instructions  for  handling  variable  length  bytes  of  information.  However,  many  of  the  elementary 
string  operations  are  efficiently  coded  using  the  available  integer  registers.  Thus,  the  overall  program 
should  be  structured  in  such  a  way  that  the  majority  of  the  string  operations  are  performed  by  the 
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host  processor. 

The  use  of  DSP’s  require  more  complex  error  checking  because  errors  can  also  occur  in  DSP 
operations.  Fortunately,  DSP  devices  typically  provide  error  flags  for  both  the  fkiating  point  and 
integer  processors.  Thus,  the  host  processor  must  only  check  the  status  of  these  flags.  On  the  plus 
side,  this  can  be  done  as  the  DSP  is  running  to  provide  a  real-time  monitor  of  activity. 

The  DSP  normally  does  not  have  its  own  operating  system,  because  it  operates  in  a  slave 
mode.  However,  the  flexibility  and  the  range  of  available  instructions  do  permit  a  simple  independent 
operating  system  to  be  developed  if  needed. 

The  power  of  the  DSP  is  best  demonstrated  in  those  problems  where  many  iterations  are 
needed.  If  the  computation  is  relatively  short,  then  it  may  not  be  advantageous  to  download  that  part 
of  the  solution  to  the  DSP. 

9.4  Future  DSP  Developments 

We  can  expect  continuous  improvements  not  only  in  microprocessors  but  also  in  future 
coprocessors  and  DSP’s.  Faster  versions  of  the  current-generation  coprocessors  are  already  available. 
Some  of  these  versions  use  less  internal  microcoding  and  more  direct  hardware  implementation  of 
floating  point  logic  to  increase  their  speed. 

The  number  of  floating 
point  functions  is  a  function  of  the 
chip  size.  Since  component  yield 
depends  on  the  chip  size,  the 
tendency  is  to  keep  the  chip  as 
small  as  possible  to  obtain  a 
profitable  yield.  This  in  turn  affects 
the  number  of  different  operations 
which  can  be  done  on  the  chip. 

However,  as  chip  manufacturing 
techniques  improve  and  feature 
sizes  shrink,  we  can  expect  that  new 
features,  such  as  double  precision  or 
fast  floating  point  division,  will  be 
included  in  the  next-generation  DSP 
devices.  Future  DSP’s  will  also 
support  capabilities  such  as  lEEE- 
format  operations  and  random 
number  generation. 

As  an  example,  an  on-chip 
random  number  generation 
capability  was  to  be  incorporated  by  Motorola  in  their  96002  DSP.  However,  due  to  the  chip  size 
constraints,  it  was  not  included.  Should  the  random  number  generation  capability  be  incorporated 
in  a  future  DSP  instruction  set,  it  may  be  useful  for  high-speed  statistical  computations.  This 
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Figure  Trend  in  DSP  computation  speed  versus  year.  The 
top  of  the  line  conventional  microprocessor  is  shown  for 
comparison. 
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generator  should  meet  the  basic  requirements  of  random  number  generation,  such  as  large  cycle  time 
and  provide  for  a  highly  uniform  distribution. 

To  gain  a  speed  advantage  in  floating  point  computations,  some  of  the  current  DSP’s  use  a 
non-IEEE  internal  floating  point  format.  Therefore,  when  the  DSP  communicates  data  to  an  IEEE- 
standard  environment  like  the  PC,  a  floating  point  conversion  is  necessary.  To  reduce  conversion 
time,  some  DSP’s,  such  as  the  DSP32C,  provide  a  single  instruction  cycle  conversion,  while  future 
DSP  versions  will  probabfy  use  IEEE  floating  point  format  directly. 

We  can  also  expect  that  some  new  support  operations  will  be  included,  such  as  those  needed 
for  efficient  evaluation  of  polynomials  and  spline  functions.  In  the  Motorola  96002,  approximate 
"seed"  values  for  inverse  and  square  root  floating  point  numbers  are  provided  in  a  single  instruction. 
Availability  of  these  operations  will  help  to  further  improve  floating  point  computation  speed  in  those 
applications  that  depend  on  division  and  square  root  operations. 

Besides  including  floating  point  units  in  the  next-generation  PC-compatible  microprocessors, 
there  are  several  other  microprocessor  designs  that  afford  significant  speed  advantages.  These 
include  the  reduced  instruction  set  computer  (RISC)  chips.  One  such  RISC  chip  is  the  Intel  i860. 
Not  only  does  it  have  a  high  throughput,  but  it  also  supports  some  of  the  DSP  operations,  as  well 
as  graphical  operations.  This  processor  has  internal  pipelining  similar  to  those  found  in  the  DSP’s. 
It  can  also  support  some  parallel  processing,  similar  to  that  found  in  the  DSP’s.  The  i860  appears 
to  be  a  good  candidate  for  future  statistics  workstation  expansion. 

9.5  Statistical  Algorithm  Performance  Evaluation 

Statistical  algorithm  selection  for  performance  evaluation  was  based  on  two  factors.  First, 
these  algorithms  had  to  be  computation-intensive.  Second,  the  selected  algorithms  had  to  have  a 
wide  applicability  in  modem  statistical  computations.  A  detailed  description  of  the  selected 
algorithms  and  performance  results  was  presented  in  Section  8. 

The  selection  of  performance  criteria  is  not  an  easy  task.  On  one  hand,  the  selected  criteria 
should  be  simple  and  easily  understandable.  On  the  other  hand,  the  selected  criteria  must  lead  to 
an  objective  evaluation  of  the  system’s  true  capabilities. 

The  performance  factors  selected  for  this  study  included  speed,  accuracy,  and  the  size  of  the 
problem.  The  program  optimization  can  affect  all  of  these  factors.  The  initial  performance  measures 
considered  ranged  from  a  simple  measure  expressing  floating  point  operation  timing  to  a  more 
complex  measure  based  on  simulation  results. 

The  advantages  and  disadvantages  of  several  performance  evaluation  methods  are  discussed 

below. 


9.5.1  Analytical  Approach 

FLOPS.  Floating  point  operations  per  second  is  the  simplest  processor  performance  measure. 
Although  the  FLOPS  rating  is  one  of  the  key  factors  used  in  advertising  the  available  DSP 
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capabilities,  the  rating  is  not  always  a  meaningful  criteria  when  applied  to  a  statistical  problem^. 
We  conclude  that  the  only  meaningful  use  for  the  FLOPS  rating  is  as  an  upper  limit  indicating  the 
potential  peak  performance  that  cannot  be  exceeded  regardless  of  program  optimization. 

Computational  Complexity.  One  approach  to  performance  evaluation  is  based  on  theoretical 
computations.  In  the  past,  on  the  basis  of  number  of  instructions  such  as  addition,  multiplication,  a 
reasonably  accurate  prediction  of  performance  could  be  made.  However,  this  approach  is  particularly 
difficult  to  use  when  evaluating  the  statistics  workstation  performance  bet^use  of  the  complex 
structure  and  behavior  of  the  DSP.  This  is  due  to  the  number  of  compound  operations,  such  as 
multiply-accumulate. 

However,  it  may  be  possible  to  develop  a  set  of  approximate  relations  which  could  be  used 
for  preliminaiy  evaluation.  These  relations  could  include  such  factors  as  the  number  of  divisions, 
function  calls,  and  other  operations  which  carry  a  substantial  overhead  in  DSP  operation. 

Simulation  Approach.  Via  simulation,  one  could  measure  performance  by  determining  the 
number  of  instructions,  nops,  wait  states,  etc.  This  is  done  by  obtaining  a  profile  of  the  program.  The 
available  software-based  DSP  simulator  (supplied  by  AT&T  as  a  part  of  the  DSP32  applications 
software  library)  provides  such  a  timing  profile  for  the  program.  It  not  only  permits  a  detailed  view 
of  the  DSP  operation  but  also  provides  all  the  information  needed  to  evaluate  performance  of  the 
low-level  subroutines.  For  example,  the  software  simulator  permits  easy  determination  of  the  number 
of  wait  states  introduced  as  a  result  of  memory  conflicts. 

Low-level  Performance  Evaluation.  Another  performance  measure  could  be  based  on  speed 
improvement  in  the  low  level  algebraic  and  statistical  routines  and  would  use  actual  computation  time. 
This  type  of  measure  is  easy  to  obtain.  However,  due  to  the  system  overhead  a  simple  relationship 
does  not  exist  between  the  low-level  performance  and  the  speed  improvement  at  the  system  level. 

Emulation.  Another  approach  to  simulation  involves  using  the  DSP  hardware-based  emulator 
to  perform  an  actual  real-time  speed  comparison.  However,  there  is  a  slight  overhead  penalty 
associated  with  using  the  hardware  emulator  due  to  the  use  of  breakpoints.  If  only  a  relative  speed 
comparison  is  desired,  then  this  overhead  is  not  a  problem. 

9.5.2  System-level  Comparison 

System  or  application-level  comparison  involves  measuring  computation  times  at  the  program 
level  and  is  most  representative  of  the  actual  workstation  capabilities,  lliis  measure  provides  the  best 
performance  criterion  because  the  user  is  normally  interested  in  the  total  program  running  time.  This 
approach  involves  developing  and  evaluating  two  different  programs.  One  of  the  programs  uses  only 
host-based  processing,  the  other  uses  DSP  support.  The  specific  modes  of  operation  are: 

Pure  host  operation.  7he  pure  host  mode  of  operation  represents  the  conventional  approach 


^  For  example,  DSP  marketing  announcements  often  assume  that  the  only  operations  that  are  performed  are 
the  MAC  operations  or  some  even  more  complex  floating  point  operation.  Thus,  in  the  Motorola  96002 
announcement,  the  peak  FLOPS  rating  assumes  concurrent  operation  of  multiplication,  addition,  and  subtraction 
(Appendix  D).  This  means  that  in  one  instruction  cycle,  3  floating  point  operations  can  be  performed.  Since  these 
capabilities  are  seldom  needed  in  conventional  computations,  the  p^  rating  is  quite  misleading. 
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to  statistical  computing.  In  this  mode  all  operations  are  performed  in  the  host.  The  measured  time 
represents  the  baseline  for  performance  improvement  evaluation. 

Master  and  slave  operation.  When  the  DSP  is  used  in  the  slave  mode,  all  of  the  I/O 
operations  are  initiated  by  the  host  processor.  Data  transfer  was  included  in  the  performance 
measure.  However,  for  the  cases  tested,  the  transfer  time  was  a  negligible  portion  of  the  total  due 
to  the  DMA  transfer  capabilities. 


Algorithm 


DSP  BSAS 
DSP  hand 
DSP  comp 
386/287 

X 

386 


Figure  9.4  Overall  timing  performance  of  DSP-based  algorithms  compared  to  conventional 
microprocessor. 


The  evaluation  results  are  given  in  Figure  9.4.  Abbreviated  names  for  the  programs  evaluated 
in  Section  8  are  given  along  the  axis.  Reading  the  legend  from  top  to  bottom,  the  modes  of 
computation  are  (■)  DSP  compiled  with  BLAS/BSAS  routines,  (-»-)  DSP  hand  optimized  code,  (*) 
DSP  compiled  with  no  BLAS/BSAS  routines,  (□)  PC  operation  with  386  host  processor  and  287 
coprocessor,  and  (x)  386  processor  alone  (this  is  the  baseline  of  unity  speed).  One  can  see  that  the 
low-level  BLAS/BSAS  routines  are  effective  in  increasing  the  performance.  In  addition,  the  DSP 
performance  is  very  sensitive  to  the  type  of  algorithm.  Whereas  the  numeric  coprocessor  gives  a 
uniform  speedup  across  the  applications,  the  DSP  depends  on  the  amount  of  array  processing  needed. 
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9.6  Cost  Evaluation 


This  section  presents  a  preliminary  cost  estimate  for  the  statistics  workstation  hardware  and 
software  development.  The  hardware  cost  estimate  is  given  for  both  the  workstation  development 
system  and  for  the  hardware.  The  software  development  costs  will  include  the  planned  Phase  II 
development  effort. 

Pricing  of  the  system  will  be  determined  by  market  The  initial  emphasis  should  be  at 
research  departments  within  universities.  This  means  that  the  cost  should  be  affordable  with  a  PC 
and  not  a  Unix-type  workstation  providing  the  initial  platform. 

Cost  Summary 

PC  Basic  system  -  $2000  DSP  Board  -  $1000 

Memory  -  $1000  Memory  -  $1000 

Hard  Disk  -  $500 

The  specifics  are  presented  below. 

9.6.1  Host  Cost 

Since  there  are  no  unique  requirements  for  the  statistics  workstation  host,  any  standard  286 
or  386-type  system  could  be  selected.  For  base  line  comparison  we  selected  a  20  MHz  386-type 
system  both  on  the  basis  of  price  and  speed. 

9.6.2  DSF  Board  Cost 

DSP  board  cost  will  depend  on  the  selected  DSP  type,  and  speed  and  memory  requirements. 
The  two  choices  are  to  use  a  commercially  available  board  or  to  develop  a  custom  DSP  board. 

In  our  Phase  I  effort  we  used  a  commercially  available  DSP32  board.  This  choice  of  the 
board  was  made  to  reduce  the  project  costs,  but  still  provide  sufficient  capability  to  evaluate  the 
expected  workstation  performance. 

Since  the  DSP  board  configuration  is  critical  to  the  success  of  the  statistics  workstation,  the 
advantages  and  disadvantages  of  using  a  commercial  board  will  be  reviewed  during  the  Phase  II  effort. 

Using  a  commercially  available  board,  A  detailed  description  of  the  commercially  available 
plug-in  DSP  boards  is  given  in  [EDN  April  26,  1990].  A  total  of  23  companies  are  either  supplying 
DSP  plug-in  boards  or  marketing  DSP  support  software.  Of  the  50  different  models  that  are 
available,  22  boards  use  floating  point  DSP’s  (AT&T  DSP32,  DSP32C,  and  TI 320C30).  The  majority 
of  these  boards  have  been  designed  to  support  analog  interfaces  and  as  such  are  more  expensive. 

There  are  many  advantages  in  using  a  commercial  DSP  board.  These  include  elimination  of 
hardware  design  costs  and  associated  risks  because  these  are  borne  by  the  board  vendor.  This  in  turn 
implies  that  software  development  testing  using  actual  hardware  can  begin  earlier  because  the  need 
for  designing,  development  testing,  and  manufacturing  are  eliminated. 
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On  the  other  hand,  there  are  also  some  disadvantages  in  using  a  commercial  board.  These 
include  lack  of  flexibility,  lack  of  special  features  needed  for  efficient  interfacing,  and  the  dependence 
on  an  external  vendor.  Since  the  DSP  board  is  from  an  external  source,  a  higher  price  is  to  be 
expected.  This  premium  in  price  may  not  be  excessive  because  a  commercial  board  is  u^  by  a  wider 
market  share.  However,  quantity  discounts  on  commercial  boards  are  not  very  high,  unless  very  large 
quantities  of  boards  are  purcha^. 

Using  an  internally  developed  DSP  board.  Development  of  a  customized  DSP  board  could 
provide  the  best  speed  improvement  because  the  interface  could  be  tailored  to  support  the  unique 
workstation  data  transfer  requirements  and  to  provide  interfacing  for  multi-DSP  applications. 
However,  the  development  costs  can  be  relatively  high  for  the  initial  design  because  this  effort 
involves  board  design,  printed  circuit  artwork  generation,  board  manufacturing,  and  assembly.  These 
higher  initial  costs  could  be  justified  only  if  a  larger  quantity  of  boards  can  be  sold  and  the  board 
design  does  not  have  to  undergo  major  changes.  Another  advantage  of  using  a  customized  board  is 
the  retention  of  the  proprietary  aspects  of  the  statistics  workstation  design. 

Thus,  to  reduce  the  risk,  custom  board  design  should  not  be  undertaken  before  the 
workstation  design  has  been  frozen,  and  the  potential  market  share  determined. 

9.6 J  System  Software  Cost 

Costing  for  the  software  is  broken  into  three  parts  (1)  development  cost,  (2)  basic  statistical 
routines  cost,  and  (3)  user  tools  cost.  The  first  cost  is  borne  by  the  developer,  while  the  latter  two 
are  market  driven.  Since  software  cost  prediction  is  not  reliable,  only  (3)  will  be  described  in  detail. 

The  basic  statistics  workstation  software  will  include  routines  that  perform  many  of  the 
computations  described  in  Section  8.  However,  some  of  the  users  may  desire  to  develop  their  own 
special  versions  of  the  program.  This  means  that  the  user  will  not  only  need  access  to  the  original 
program  source  but  will  also  need  the  two  C  compilers  to  support  both  host  and  DSP  programming. 

DSP  assembler.  A  standard  DSP  assembler  is  sufficient  and  most  appropriate  for  low-level 
module  development  and  optimization.  Fortunately,  the  AT&T  DSP32  assembly  level  coding  is 
relatively  easy  to  learn  because  of  the  higher-level  instructions  which  are  available  in  the  DSP. 
However,  there  are  programming  difficulties  due  to  the  pipeline  effects  and  other  DSP  architecture 
imposed  instruction  constraints. 

DSP  C  compiler.  A  C  compiler  is  needed  to  support  more  complex  program  development. 
The  key  advantage  of  using  a  DSP  C  compiler  is  in  the  reduction  of  the  program  development  time. 
The  use  of  the  C  compiler  also  permits  parallel  program  development. 

Macro  generator.  A  macro  generator  provides  another  approach  to  decreasing  program 
development  time.  Although,  the  macro  generators  are  effective  in  providing  substantial 
improvement  in  program  development  time,  they  are  not  widely  used.  They  are  also  particularly 


96 


suitable  in  those  situations  where  a  well-defined  high-level  problem  description  exists*. 

During  the  Phase  I  development  effort  the  use  of  STAGE2  [Waite  1973],  a  more  capable 
macro  generator,  was  investigated.  In  particular,  the  STAGE2  macro  generator  was  used  for 
automatically  generating  a  number  of  DSP  interface  programs  using  a  high-level  program  description. 
Although  its  use  did  not  affect  directly  the  speed  of  the  resulting  programs,  it  did  reduce  benchmark 
program  development  time  and  the  need  for  debugging.  The  automatic  generation  of  the  interface 
programs  almost  completely  eliminated  typing  errors  during  the  coding  phase. 

Although  the  STAGE2  macrogenerator  was  used  during  the  initial  development  effort,  a 
custom  program  for  generating  the  same  interface  could  be  written  in  a  higher  level  language  (such 
as  AWK  or  YACC  from  UNIX). 

9.7  Risk  Analysis 

Although  there  is  risk  associated  with  the  development  of  some  of  the  algorithms,  a  significant 
payoff  can  be  expected  because  the  concept  has  been  proven  feasible.  The  single-chip  DSP  devices 
are  widely  available  and  their  use  is  expected  to  grow  at  a  30%  annual  rate.  In  1989  the  single-chip 
DSP  market  had  already  reached  $1  billion  in  sales  (Computer  Design,  May  1,  1990). 

We  can  also  expect  further  developments  in  DSP  applications  and  the  availability  of  new  and 
even  more  powerful  DSP  devices  in  the  f^uture.  At  the  same  time,  the  cost  of  the  DSP  devices  is 
expected  to  decrease. 

The  availability  of  the  next-generation  DSP  devices  will  not  obsolete  the  present  statistics 
workstation  algorithm  development  because  the  basic  operations  of  the  DSP,  such  as  MAC,  will  not 
change.  In  fact,  the  new  features  that  have  been  promised  for  the  next-j,eneration  DSP’s  will  help 
to  further  optimize  the  algorithms.  More  powerful  languages,  such  as  Numeric  C,  which  support 
mathematical  operations  on  vectors  and  arrays  may  also  be  standardized  for  DSP  use. 

Thus,  we  can  expect  that  the  DSP  devices  will  retain  their  statistical  computation  speed 
advantages  over  the  conventional  microprocessors  (Figure  9.3)  in  the  future. 


^  Since  the  DSP  interface  programs  are  relatively  complex  in  structure,  a  simple  macro  processor,  such  as  the 
one  included  as  a  preprocessor  to  the  AT&T  DSP  assembler  and  the  C  compiler,  is  not  suitable  for  the  automatic 
generation  of  interface  programs. 


97 


10  CONCLUSIONS  AND  RECOMMENDATIONS 


This  section  summarizes  Phase  I  feasibility  investigation  results  and  presents  recommendations 
for  the  Phase  II  statistics  workstation  development  effort 

10.1  Conclusions 


10.1.1  Feasibility  of  Statistics  Workstation 

The  results  of  the  Phase  I  effort  fully  substantiated  our  initial  projections  for  a  DSP-supported 
statistical  computing  environment  No  major  obstacles  to  speed  improvement  using  the  DSP  were 
discovered.  In  fact,  the  majority  of  the  statistical  algorithms  tested  were  easily  modified  to  provide 
substantial  improvements.  In  those  cases  where  major  improvements  were  not  achieved,  the 
difference  was  due  to  operations  that  were  not  optimal  with  respect  to  the  DSP  instruction  set. 

The  hardware  and  software  interfaces  between  the  host  processor  and  the  DSP  were  found 
to  be  relatively  simple  and  did  not  present  any  major  implementation  problems.  With  the  planned 
use  of  the  DSP32C  in  the  next  phase,  the  interface  efGciency  will  be  further  improved.  This 
improvement  will  be  due  to  a  number  of  factors,  such  as  faster  clock  speed,  use  of  a  16-bit  parallel 
interface,  and  faster  IEEE  floating  point  format  conversion. 

A  cost  analysis  fr^r  the  DSP-based  statistics  workstation  was  presented  in  Section  9.  This 
analysis  showed  that  the  initially  proposed  cost  objectives  can  be  easily  met. 

10.1.2  Suitability  of  DSP  in  Specific  Applications 

It  is  our  conclusion  that  a  DSP-supported  system  is  highly  suitable  for  most  of  the 
computation-intensive  statistical  problems,  particularly  those  which  can  use  the  unique  processing 
capabilities  of  the  DSP,  such  as  MAC,  address  autoincrementing,  etc.  A  favorable  DSP  instruction 
mix  will  greatly  affect  performance.  This  means  a  high  ratio  of  MAC  operations  with  array  indexing. 

In  those  applications  which  involve  many  conditional  checking  and  branching  operations,  the 
additional  pipeline  overhead  can  substantially  reduce  the  potential  gain.  It  is,  however,  possible  to 
compensate  for  this  loss  by  careful  optimization  in  the  inner  loops  and  by  using  conditional 
accumulator  load  instructions  which  do  not  result  in  pipeline  delays. 

The  single  precision  floating  point  normally  limits  the  accuracy  of  the  solution.  However,  the 
use  of  single  precision  can  hold  down  the  system  cost  because  memory  requirements  for  data  storage 
are  reduced  by  one  half  over  the  double  precision  case.  A  number  of  techniques  are  available  to 
minimize  the  effects  of  the  single  precision  accuracy  limitations,  such  as  data  centering.  In  some 
applications,  such  as  exploratory  data  analysis  where  speed  improvement  and  cost  is  of  major 
importance,  single  precision  operation  may  be  preferable. 
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10.13  Performance  and  Cost 


The  improvement  in  computation  speed  usitig  the  DSP  can  be  expressed  as  a  simple  ratio 
between  the  conventional  approach  and  the  DSP-based  approach.  This  improvement,  however,  will 
be  dependent  on  problem  type,  size,  and  complexity.  The  use  of  a  numerical  coprocessor  in 
conjunction  with  the  host  resulted  in  only  a  factor  of  4  (our  measure)  to  20  (highest  performance 
coprocessor  available)  speedup  over  the  host  alone.  However,  since  the  numerical  processors  cannot 
operate  independently,  parallelism  is  never  achieved. 

The  use  of  a  DSP  in  the  majority  of  situations  resulted  in  a  major  improvement  over  the 
numeric  coprocessor,  as  shown  in  Figure  9.4. 

Based  on  the  processing  speed  and  memory  requirements,  the  hardware  cost  and 
cost/performance  ratio  was  found  to  be  within  the  initial  projections.  For  the  rather  low-cost  and 
low-speed  DSP  we  evaluated,  the  cost/performance  was  estimated  to  be  greater  than  1  MFLOPS/$lK. 
This  is  near  the  projection  originally  set  by  Figure  5.1.  The  hardware  cost  estimate  included  the  DSP 
board  and  memory  cost  along  with  a  readily  available  PC.  Not  included  in  this  cost  estimate  was  the 
supporting  software  such  as  DSP  and  PC  assemblers,  compilers,  and  linkers.  This  is  typically  not 
included  in  the  workstation  hardware  cost. 

10.1.4  Anticipated  BeneBts 

As  shown  in  this  feasibility  study,  the  goal  of  high-speed  and  low-cost  statistical  computation 
can  be  achieved  by  emphasizing  an  applications-oriented  approach,  using  the  speciali:^  DSP 
architecture,  and  optimizing  low-level  algorithms  to  provide  for  high  performance  statistical 
computation  building  blocks. 

The  successful  completion  of  all  phases  of  the  statistics  workstation  project  will  provide 
affordable  processing  power  to  those  researchers  involved  with  computation-intensive  statistical  tasks. 
Not  only  will  they  have  more  time  available  for  productive  research,  but  they  also  will  be  able  to 
investigate  some  of  the  advanced  techniques  that  are  now  cost  prohibitive.  Tlie  economic  benefits 
will  be  reduced  research  costs. 

10.2  Potential  Statistics  Workstation  Applications 

The  availability  of  low  cost  equipment  will  be  of  interest  to  university  computer  support  and 
instrumentation  support  programs,  as  well  as  commercial  enterprises  that  do  extensive  statistical 
analysis.  Applications  include  research  laboratories,  quality  control,  and  financial  concerns. 
Specialized  application  areas  include  manufacturing  systems,  supercomputing,  and  scientific 
instrumentation.  As  an  example,  statistical  analyses  of  processes  and  defects  are  important  design 
considerations  for  monitoring  instrumentation  that  wll  be  used  on  a  manufacturing  line  [Pukite  1990]. 

Highly  complex  problems  also  exist  in  other  fields,  such  as  physics,  medicine,  and  engineering. 
In  these  fields,  cost-effective  solutions  to  complex  problems  have  been  achieved  in  engineering  and 
scientific  applications  by  developing  specialized  computers  [Fox  1988,  Alder  1988].  The  speed  and 
cost  advantage  in  these  applications  is  realized  by  narrowing  the  range  of  the  computations  and  by 
using  specialized  hardware  to  handle  the  highly  regular  and  repetitive  tasks.  Samples  of  this 
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approach  include  commercially  available  logic  simulation  and  layout  accelerators  for  the 
semiconductor  industry. 

In  developing  the  statistics  workstation  we  are  following  a  similar  approach.  We  can  also 
expect  that  statistical  techniques  that  were  not  used  previously  due  to  their  reliance  on  extensive 
numerical  computation  may  become  routine  with  the  high  speed  computing  capability  available  in  the 
statistics  workstation.  Thus,  the  workstation  users  will  be  able  to  obtain  a  km  cost  and  high  speed 
solution  to  their  statistical  problems. 

10J!.l  Expected  Future  Improvements 

Although  we  can  predict  with  certainty  that  improved  DSP  devices  will  be  available  in  the 
future,  it  is  difficult  to  pi^ict  the  expected  improvements  in  performance  and  reduction  in  cost 
There  are,  however,  two  processors  which  could  have  a  major  impact  on  statistical  computations,  the 
Intel  860  and  the  Motorola  96002. 

The  Intel  860  is  not  a  true  DSP  device  although  it  does  have  many  of  the  features  of  a  DSP. 
It  does  have  some  distinct  advantages  over  the  present  DSP  devices  in  that  it  supports  double 
precision  computations  and  includes  some  graphics  operations  for  three-dimensional  displays. 

Although  the  Motorola  DSP  is  not  yet  in  full  production,  development  software  is  already 
available.  Based  on  the  number  of  features  of  this  device  and  its  precision  and  speed,  it  should 
improve  the  performance  of  the  statistics  workstation.  However,  the  lack  of  operational  hardware 
has  prevented  a  full  performance  evaluation. 

Multi-DSP  concept  Using  several  DSP’s  in  a  single  workstation  has  the  potential  of  further 
improving  computation  speed.  The  cost  advantage  results  from  the  cost  of  the  DSP  being  a  nonlinear 
function  of  its  speed.  For  example,  when  the  processor  speed  is  doubled,  the  added  integrated  circuit 
engineering  design  and  production  cost  may  increase  by  a  factor  of  four  or  more. 

Even  though  the  multi-DSP  concept  was  not  evaluated  in  detail  for  this  phase,  several 
application  areas  are  worth  noting.  The  best  problems  for  multi-DSP  applications  will  be  those  that 
lend  themselves  to  easy  partitioning  of  computational  tasks,  such  as  brotstrapping  or  Monte  Carlo 
analysis.  The  DSP32  design  permits  easy  implementation  of  clustering  of  individual  processors  using 
the  serial  interface.  A  number  of  multiprocessor  DSP32  implementations  have  bron  described  in 
AT&T  application  notes  [AT&T  1988]. 

10.3  Recommendations 

specific  recommendations  are  made  in  this  section  regarding  the  prototyping  of  a  more 
advanced  statistics  workstation  configuration  and  further  statistical  software  evaluation  and 
optimization.  The  proposed  Phase  II  effort  will  build  on  the  success  of  the  Phase  I  feasibility 
demonstration  and  will  result  in  an  operational  prototype  of  the  statistics  workstation. 
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lOJ.l  Selected  Statistics  Workstation  System  Configuration 

Although  the  basic  workstation  architecture  will  not  change  during  Phase  II,  several  new 
features  and  improvements  vtrill  be  added.  These  will  include  a  dynamic  memory  allocator  for  DSP 
memory  management  and  computation  task  manager.  The  computation  task  manager  will  handle 
computation  task  assignment  to  either  PC  or  to  DSP’s.  The  manager  will  also  balance  the  computing 
load  between  PC  and  DSP’s. 

Since  we  can  expect  not  only  new  DSP  devices  to  appear  in  the  near  future,  but  also  potential 
reductions  in  device  prices,  the  status  of  the  DSP  technology  will  be  reviewed  at  the  start  of  the  the 
Phase  n  effort.  Thus,  the  investigation  will  include  evaluation  of  the  Intel  860  and  Motorola  96002 
devices  which  should  be  available  by  the  start  of  that  eRbrt. 

10J.2  Statistics  Workstation  Hardware  Recommendations 

The  statistics  workstation  hardware  must  be  optimized  to  provide  a  cost-effective  solutions 
to  the  computation-intensive  statistical  problems.  A  brief  discussion  of  the  recommended  hardware 
for  a  prototype  demonstration  is  presented  below. 

Host  system.  Any  higher  performance  Kbc86-type  system  (such  as  the  386)  would  provide  a 
stable  platform  for  prototyping. 

Global  Memory.  Since  memory  costs  are  continuing  to  decrease,  no  extra  effort  was 
attempted  to  reduce  overall  memory  requirements.  The  specific  memory  size  and  speed  requirements 
will  depend  to  a  great  extent  on  the  specific  application.  Two  extreme  application  cases  can  be 
identified.  In  the  first  case  we  are  present^  with  limited  amount  of  input  data,  but  many 
computations  are  needed.  In  the  second  case  we  have  a  large  amount  of  data,  but  only  a  limited 
number  of  computations.  The  first  case  is  more  suitable  for  DSP  processing,  whereas  the  second  case 
can  be  handled  by  the  host  processor  which  has  fast  access  to  a  hard  disk.  Since  most  of  the  actual 
problems  will  lie  between  these  two  extremes,  memory  will  have  to  be  sized  and  chosen  (either 
SRAM,  DRAM,  or  disk  storage)  according  to  the  application. 

Disk  Storage.  Hard  disk  storage  is  important  for  fast  operation,  as  any  delay  in  data  access 
will  negate  the  speed  of  the  computations  performed  by  the  DSP. 

Input  Devices.  A  keyboard  with  mouse  support  provides  the  standard  interface. 

Display.  Most  of  the  statistical  display  needs  can  be  met  with  conventional  color  display  cards 
(e.g.  VGA  format).  If  more  data  points  or  a  more  sophisticated  graphical  display  is  required,  then 
a  graphics  coprocessor  may  be  needed  to  handle  the  higher  display  resolution  and  the  increased 
display  processing  load.  Therefore,  the  use  and  direct  interface  of  a  graphics  processor  to  the  host- 
DSP  system  should  be  investigated.  For  most  statistical  applications  and  particularly  exploratory  data 
analysis,  a  high-resolution  color  display  is  preferable  over  a  monochrome  monitor.  Factors  involved 
in  the  selection  of  the  display  monitor  include  resolution  and  monitor  size. 

Digital  Sigpal  Processor.  A  wide  variety  of  choices  exist.  Since  DSP  devices  differ  in  their 
instruction  sets  and  capabilities,  a  standard  approach  for  incorporating  various  DSP  devices  is  not 
feasible.  The  majority  of  conventional  DSP  plug-in  boards  have  been  designed  to  interface  with 
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analog  signals.  Designing  a  custom  DSP  board  would  enable  more  functionality  to  be  added  to  the 
board.  Particularly  important  is  to  add  multiple  DSP’s  to  further  speed  up  the  statistics  workstation 
operation.  Adding  multiple  DSP’s  on  a  sin^e  board  is  a  more  cost-effective  solution  than  adding 
additional  boards. 

lOJ J  Statistics  WorkstatioB  Software  Rccomaseadatioas 

The  statistical  software  prototyped  in  the  Phase  I  effort  was  limited  to  a  number  of  select 
modules  needed  to  demonstrate  the  feasibility  of  the  concept  Several  extensions  to  Urn  foundation 
will  be  needed  before  the  statistics  workstation  becomes  a  viable  system.  This  means  that  a 
comprehensive  set  of  solution  techniques  will  have  to  be  developed  and  included  in  the  basic 
prototype  support  package.  The  planned  statistics  application  workstation  software  will  consist  of  a 
systems  manager,  user  and  language  interface,  utility  programs,  and  a  statistical  routine  library. 

Systems  manager.  The  statistics  workstation  system  manager  program  will  support  the  user 
interface  and  control  ail  computing  tasks.  The  user  interface  support  will  include  help  and  control 
menus,  input  and  output  device  drivers,  database,  and  edit  functions.  The  control  hinctions  will 
handle  task  priority  assignments,  task  sequencing,  task  scheduling,  and  task  monitoring.  In  addition, 
the  manager  will  handle  DSP  program  and  data  transfer. 

During  the  Phase  I  effort,  emphasis  was  on  demonstrating  the  feasibility  of  the  proposed 
approach  and  evaluating  the  potential  speed  improvement  in  statistical  computations.  As  a  result, 
the  developed  software  had  a  rather  elementary  interface.  During  the  proposed  Phase  II  effort,  more 
emphasis  should  be  placed  on  choosing  an  efOcient,  interactive,  and  user  friendly  interface. 

Error  recovery.  A  thorough  error  handling  module  will  be  developed  and  added  to  the  statistics 
workstation.  This  module  will  perform  extensive  data  integrity  checking  and  will  help  to  recover  in 
case  of  an  error.  Since  computation-intensive  programs  may  require  long  running  times,  even  when 
using  hardware  accelerators,  a  running  time  estimator  module  is  essential 

Utility  programs.  Because  of  the  widespread  use  of  other  statistics  packages,  support  tools 
include  data  translation  programs  needed  for  importing  and  exporting  data  between  the  applications. 

Graphics.  Besides  the  interactive  display  software  required,  graphics  support  should  also 
include  development  of  graphical  output  reports  as  well  as  interfaces  to  presentation  packages.  In 
addition  to  these  common  interfaces  (such  as  EGA  and  VGA),  selected  higher  resolution  graphics 
boards  (super-VGA)  will  provide  extended  visualization  capability.  The  specific  support  included  will 
depend  on  the  commercial  availability  and  software  for  the  appropriate  graphics  drivers. 

Statistical  application  software.  Statistical  applications  modules  will  use  the  DSP  support  only 
where  needed  to  reduce  the  amount  of  data  transfer  and  high  speed  memory.  The  initial  application 
modules  will  include  those  that  were  identified  and  investigate  as  part  of  this  effort.  In  addition, 
a  problem-oriented  language  support  will  be  provided  for  solving  user-defined  problems.  An  online 
help  module  will  display  particular  solution  techniques  available  in  the  system. 

Statistical  language  interface.  In  the  S  language,  as  well  as  the  IML  language  provided  by  SAS, 
the  basic  computing  modules  that  perform  the  majority  of  computations  are  compiled  and  available 
in  a  library.  An  online  interpreter  processes  the  user  commands  and  calls  the  specific  subroutines. 
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Since  most  of  the  computations  are  performed  by  the  compiled  modules,  the  overhead  introduced 
by  the  interpreter  will  not  be  substantial. 

During  the  Phase  II  effort  a  similar  approach  will  be  investigated.  Since  ail  of  the  key  lower- 
level  subroutines  were  already  protyped  during  Phase  I,  the  Phase  11  effort  will  concentrate  on  the 
intermediate  and  top-level  software  requirements.  To  create  a  statistical  language  for  a  DSP-based 
workstation  will  require  a  carefully  laid  out  plan  that  considers  both  the  PCs  and  DSP’s  strengths  and 
weaknesses. 

Application  development  system.  For  the  sophisticated  user,  the  statistics  workstation 
development  system  is  a  set  of  software  tools  for  developing  DSP-supported  statistical  application 
programs.  This  system  generates  C-language  functions  and  data  structures  for  interfacing  the  DSP 
with  the  PC.  The  functions  and  procedures  are  then  linked  with  the  user’s  application  code  to  form 
the  final  program. 

This  package  gives  the  statistical  program  developer  the  capability  to  integrate  DSP-based 
modules  into  new  or  existing  applications  programs.  The  development  system  consists  of  DSP  C 
compiler,  library  of  low-level  statistics  subroutines,  PC  -  DSP  interface  utilities,  software  description 
language  interface  generator,  and  multitasking  and  graphics  support  software. 


The  elements  of  the  support  software  are  summarized  below. 

Applications  System  Applications  interface  (manager,  graphics, ..) 

Statistical  language  command  interpreter 
Statistical  software  DSP  executable  files 

Development  System  DSP  C  compiler  (including  assembler,  linker, ..) 

PC  C  compiler  (including  assembler,  linker, ..) 
Library  of  low-level  statistics  subroutines 
PC  ~  DSP  interface  utilities 
Software  description  language  interface  generator 
Multitasking  and  graphics  software 
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A  Appendix  -  Glossary  of  Basic  Statistical  Subroutines 


This  appendix  describes  the  low-ievel  statistics  and  linear  algebra  routines  (BSAS  and  BIAS) 
that  have  been  prototyped  in  this  study.  For  each  routine,  both  the  C  code  (Turbo  C  version  2.0) 
and  DSP  assembler  cc^e  (AT&T  DSP32^  are  given.  Both  of  these  code  versions  can  be  compiled 
and  linked  with  the  compatible  C  compiler  (AT&T  DSP  C  compiler  version  1.3.3)  for  use  on  the 
DSP.  The  subroutines  can  then  be  us^  to  drive  higher-level  applications.  The  C  code  alone  can 
be  used  to  run  on  the  PC  alone  for  initial  prototyping  and  debugging.  Differences  in  performance 
between  DSP  and  PC  implementations  was  most  often  determined  at  the  level  of  these  routines. 

Notation 

The  notation  S<name>,  in  for  example  SCOPY,  indicates  single  precision.  The  label 
SCOPY:  indicates  the  starting  memory  location  of  the  subroutine.  Within  the  DSP  routines,  the 
inner  loop  return  addresses  end  with  I:  (for  example  routine  ABSDEV  has  label  absdevi:).  To  set  the 
increments  to  4  bytes  (size  of  single-precision  floating  point  number),  two  consecutive  multiply  by  2 
instructions  are  performed  (i.e.  *r4-*--*-rl5  increments  array  by  rl5,  which  is  a  multiple  of  4). 
Protected  registers  in  the  DSP  and  compiler  for  function  calls  are  the  following  : 

rl8  holds  the  return  address  from  the  subroutine. 

rl9  is  for  incrementing  the  stack  and  its  value  is  set  to  4  for  floating  point. 
rl4  is  the  stack  pointer  for  passed  arguments. 

Labels  such  as  A_ZERO  store  global  data  (in  this  case,  the  value  0.0).  Return  float  values  are  stored 
in  aO  and  return  integer  values  are  stored  in  rl. 


”  The  DSP32C  version  has  also  been  coded  but  is  not  shown  for  proprietary  and  space  limitation  reasons. 
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ABSDEV 


Prototype  :  float  ABSDEV  (  int  N,  float  SX{  ).  int  INCX,  float  SA  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  array 

INCX  array  integer  increment  or  step 

SA  floating  point  reference  value 

Description  :  ABSDEV  returns  the  sum  of  absolute  deviations  of  the  elements  of 

array  SX  from  SA.  This  is  useful  for  calculating  rolnist  statistical  parameters.  By  using  the 
DSP  ifaltO  fimction,  this  is  well  suited  for  the  DSP. 


C  code 

float  ABSOEV  (  Int  N,  float  SX[].  int  INCX, 
float  SA  ) 

{  register  int  i,  n; 
static  float  sum; 

sum  »  0.0;  /•  initial  value  •/ 

for  (ian>0;  n<N;  i+«INCX,  n++) 
sum  »  fabs(  SX[i]  -  SA  ); 

/*  sum  abs  value  of  a11  elements  */ 
retum(  sum  ); 

}  /•«»•  End  ABSOEV  •«•/ 


DSP  code 

ABSDEV:  •rlA^^rlB  -  a2  >  a2 

•r14++r19  •  a3  •  a3 
nop 

rU  •  rU  -  24 

a3  »  •r14++rl9  /•  SA  •/ 

r15  -  •r14++r19  /•  INCX  •/ 

r4  -  •r14++r19  /•  SX[]  •/ 

r3  »  •r14++r19  /•  N  •/ 

r1  -  A  ZERO 

r15  -  rTS  •  2 

r15  -  r15  •  2  /•  float  INCX  •/ 

r3  •  r3  -  2  /•  adjust  N  •/ 

aO  «  “rl  /•  initial  factor  •/ 

absdevi:  a1  •  -a3  ♦  •r4++r15 

a2  »  -a1 
a2  »  ifalt(al) 

valpos:  if  (r3 — >«0)  goto  absdevi 

aO  «  e:2  aO  /•  summing  term  •/ 
a2  »  •r14++r19 
a3  »  *r144+r19 
return(r18) 

r14  =  r14  -  8  End  ABSDEV  •*•/ 


_ ADDCPY 

Prototype  :  void  ADDCPY  (int  N,  float  SX[  ],  int  INCX,  float  SY[  ],  int  INCY,  float  FACTOR) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  X  array  input  y^  =  +  b  :  i  =  OT 

INCX  integer  incremoit  for  x 

SY[  ]  floating  point  y  array  output  f  =  X  +  b 

INCY  integer  increment  for  y  _ ! _ 

FACTOR  floating  point  scalar 

Description  :  ADDCPY  adds  a  floating  point  scalar,  FACTOR,  to  the  elements  of  an  array.  The  modified 
values  are  returned  in  a  separate  array.  This  is  useful  for  centering  around  means,  etc.  Array 
multiplications  make  this  suited  for  the  DSP. 
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C  code 


OSP  code 


ADOCPY:  r14  -  r14  -  24 

a1  •  •r14++r19  /•  Factor  adding  •/ 

r17  -  •r14++r19  /•  INCY  •/ 

r1  .  •r14^r19  /•  SY  •/ 

r16  »  •r14^rl9  /•  INCX  •/ 

r3  -  •r14++r19  /•  SX  •/ 

r15  ■  •r14++r19  /•  N  •/ 

r17  •  r17  •  2 

p17  -  r17  •  2 

r16  -  r16  •  2 

r16  •  r16  •  2 

r15  »  r15  -  2 

addcpyl:  If  (r15 —  »0)  goto  addcpyl 

•r1-*-*-r17  «  aO  >  a1  >  •r3^r16 
return  (r18) 

nop  End  ADOCPY  •»/ 


ADDSCAL 


void  ADDSCAL  (  int  N,  float  SY[  ],  int  INCY,  float  A,  float  B  ) 

N  number  of  elements  in  array 

SY[  ]  input  floating  point  y  array 

INCY  integer  increment  for  y  array 

A  floating  point  multiplier 

B  floating  point  translation 

ADDSCAL  scales  and  translates  a  vector.  This  is  useful  for  doing  an  in-place  linear 
transformation.  Multiply  and  accumulate  in  one  instruction  makes  this  well  suited  for  the 
DSP. 

C  code  DSP  code 

void  ADDSCAL  (  Int  N,  float  SY[  ],  int  INCY.  ADDSCAL: 

float  A,  float  B  ) 

{  register  Int  1,  n; 

for  (i=n=0;  n<N;  1+=INCY,  n++) 

SY[1]  =  A  •  SY[i]  +  B; 

/•  linear  transformation  on  all  elements  •/ 

)  /•••••  End  ADDSCAL  ***•/ 


addscall: 


•rlA^+rlB  °  a2  =  a2 
nop 

rl4  =  rl4  -  24  /•  (l+5)*4  •/ 

a2  =  •rl4++rl9  /•  Translation  •/ 

al  .  •rl4++rl9  /•  Multiplier  •/ 

rl7  =  •rl4++rl9  /•  INCY  •/ 

r1  =  VUt+rlB  /•  SY  •/ 

rl5  =  •rl4++rl&  /•  N  •/ 

rl7  .  rl7  •  2 

rl7  =  r17  •  2 

rl5  =  rl5  -  2 

1f  (rl5 —  >*0)  goto  addscall 
•rl++rl7  =  aO  =  a2  +  al  •  •rl 
a2  =  ‘rU 
return  (rl0) 

nop  /•••••  End  ADDSCAL  ••••/ 


Prototype  : 
Arguments  : 


Description  : 


void  ADDCPY  (  Int  N,  float  SXf],  Int  INCX, 
float  SY[].  Int  INCY.  float  FACTOR  ) 

{  register  int  1,  J,  n; 

for  (1»j»n»0;  n<N;  1+«INCX,  j+«INCY,  n++) 
SY[j]  -  SX[1]  +  FACTOR; 

/*  add  factor  to  each  element  */ 

)  /«•••  End  ADDCPY  ••••/ 


ADDSCALCPY 


Prototype  : 
Arguments  : 


void  ADDSCALCPY  (  int  N,  float  SX[  ].  int  INCX,  float  SY[  ].  int  INCY. 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  X  array 

SY[  ]  output  floating  point  y  array 

INCY  integer  increment  for  y  array 


float  A  ,  float  B  ) 
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A  floating  point  scaling  factor 

B  floating  point  translatioo 

Description  :  ADDSCALCPY  scales  a  vector  and  adds  a  translation  before  copying  to  y.  This  is  the  basis 

for  making  a  linear  transformation.  Multiply  and  accumulate  in  one  instruction  makes  this  well 
suited  for  the  DSP. 


C  code 

void  ADDSCALCPY  (  1nt  N.  float  SX[],  1nt  INCX, 
float  SY[].  1nt  INCY,  float  A.  float  B  ) 

{  register  Int  1,  J,  n; 

for  (1»J«n«0;  n«N;  1+alNCX,  J4--INCY,  n*-^) 

SY[J]  «  A  •  SX[1]  +  B; 

/*  linear  transformation  on  all  elements  */ 
)  /*»«  End  ADDSCALCPY  •»•/ 


DSP  code 

ADDSCALCPY:  «rl44^rl9  -  a2  «  a2 
nop 

rl4  -  rl4  -  32  /•  (7t-l)*4  •/ 

a2  ■  •rl44-*-rl9  /•  Translation  •/ 

•1  .  •r14++rl9  /•  Multiplier  •/ 

rl7  -  •rl4++rl9  /•  INCY  •/ 

rl  -  •rl4++rl9  /•  SY  •/ 

rl6  »  •rl4«-rl9  /•  INCX  •/ 

r3  -  •rl^wrl9  /•  SX  •/ 

rl5  «  •r144-*.rl9  /•  N  •/ 

rl7  -  rl7  •  2 

rl7  -  rl7  •  2 

rl6  «  rl6  •  2 

rl6  »  rl6  •  2 

rl5  •  rl5  -  2 

asccpyl:  if  (rlS —  »0)  goto  asccpyl 

•rl-*-t-rl7  ■  aO  •  a2  'f  al  *  •r3+-*-rl6 
a2  »  VU 
return  (rlfl) 

nop  End  ADDSCALCPY 


ADDVEC 


Prototype : 
Ai^guments : 


Description  : 


void  ADDVEC  ( int  N.  float  SX[  J,  float  SY[  ],  float  SZt  ] ) 

N  number  of  eieiiKnts  in  array 

SX[  ]  floating  point  x  array  input 

SY[  ]  floating  point  y  array  input 

SZ[  ]  floating  point  output  ana;, 

ADDVEC  adds  two  floating  point  vectors.  The  modified  values  are  returned  in  a  separate 
array.  The  array  addition  is  well  suited  for  DSP  operation. 


f  *  X*  +  y* 


C  code 

DSP  code 

void  ADDVEC  (  Int  N,  float  SX[],  float  SY(]. 

ADDVEC: 

rl4  .  rl4  -  16 

float  SZ[]  ) 

rl  «  •rl4++rl9  /•  output  array  SZ[]  •/ 

{  register  Int  n; 

r2  »  ‘rlA^+rlD  /•  Input  array  SY[]  •/ 
r3  »  "rlAt+rlD  /•  Input  array  SX[]  •/ 

for  (n»0;  n<N;  n++) 

SZ[n]  -  SX[n]  +  SY[n]; 

r4  »  "rlAr+rlD  /•  1  of  elements  N  •/ 

nop 

)  /•••••  End  ADDVEC  ••**/ 

r4  «  r4  -  2 

addvecl: 

If  (r4 — >»0)  goto  addvecl 
•rl++  «  aO  »  •r3++  •r2++ 

return(rl0) 

nop  /•••  End  ADDVEC  *••/ 

CDF 


Prototype : 
Arguments  : 


void  CDF  (  int  N.  float  SX[  J,  int  INCX,  float  SY[  ].  int  INCY  ) 
N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 
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Description  : 


INCX  integer  inclement  for  x  array 

SY[  ]  output  floating  point  y  array 

INCY  integer  increment  for  y  array 

CDF  computes  the  running  sum  or  cumulative  distribution 
flmction  for  an  input  array.  The  array  addition  makes  this 
well  suited  for  DSP  operation.  For  long  arrays  it  may  be  wise  to  center  the  data  befordund  in 
order  to  reduce  roundoff  errors. 


J 

yj  * 

:  /•  = 

C  code 


DSP  code 


void  COF  (  1nt  N,  float  SX[],  1r»t  INCX, 
float  SY[],  int  INCY  ) 

{  register  int  1,  Ji  n; 
static  float  accum; 

accuffl  •  0.0;  /•  Initial  sun  •/ 

for  (1-j-n-O;  n<N;  1+-INCX,  j+-INCY.  n++) 
SY[J]  »  accum  +«  SX[1]; 

/*  accumulate  sum  of  SX  */ 
}  /•*•••  End  COF  ••••/ 


COF;  r14  -  rU  -  20 

rl7  -  •r14^-fr19  /•  INCY  •/ 

r1  .  •r14++r19  /•  SY  •/ 

r16  -  ^14++^  9  /•  INCX  •/ 

r3  •  •rl4^rl9  /•  SX  •/ 

rl5  »  •rl4^r19  /•  N  •/ 

r2  -  A  ZERO 

rl6  «  fl6  •  2 

r16  »  rl6  •  2 

r17  -  rl7  •  2 

r17  -  r17  •  2 

r15  .  r15  -  2 

aO  «  •r2 

cdfl:  If  (r15 —  >■  0)  goto  cdfl 

•r1++r17  •  aO  •  aO  +  ‘rif+rie 

return  (r18) 

nop  /«•  End  COF  •»/ 


CENTER 


Prototype : 
Arguments  : 


Description  : 


void  CENTER  ( int  N,  float  SYt  ],  int  INCY,  float  FACTOR  ) 

N  number  of  elements  in  array 

SY[  ]  floating  point  y  array  input 

INCY  integer  increment  for  y 

FACTOR  floating  point  scalar 
CENTER  adds  a  floating  point  scalar,  FACTOR, 
to  the  elements  of  an  array.  The  modified  values 
are  returned  in  the  same  array.  This  is  useful  for  centering  around  means,  etc.  The  array 
operation  makes  this  well  suited  for  DSP  operation. 


yf  -  yf  *  b  :  i  =  or 

y  -  y  +  b 


C  code 

void  CENTER  (  Int  N,  float  SX[],  Int  INCX, 
float  FAaOR  ) 

{  register  Int  1,  n; 

for  (1»n=0;  n<N;  1+*INCX,  n++) 

SX[i]  +=  FAaOR; 

/•  add  factor  to  each  element  •/ 

}  /•••••  End  ADO  ••••/ 


DSP  code 

CENTER:  rl4  .  rl4  -  16 

al  »  ’rlA+^-rlD  /•  Factor  adding  •/ 

rl6  =  •rl4++r19  /•  INCX  •/ 

r3  •  •r14++rl9  /•  SX  •/ 

rl5  •  •rl4++rl9  /•  N  */ 

rl6  »  rl6  •  2 

rl6  =  rl6  •  2 

rl5  =  rl5  -  2 

centl;  If  (rl5~  >*0)  goto  centl 

•r3^r16  »  aO  =  al  +  *r3 
return  (rl8) 

nop  /•••  End  CENTER  •••/ 
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CSUM 


Prototype : 
Arguments : 


Descriptioa : 


float  CSUM  ( int  N.  float  SX[  ],  int  INCX  ) 

N  number  of  elemeols  in  amy 

SX[  ]  floating  point  array 

INCX  integer  inciemeat  for  array 

CSUM  calculates  the  cumulative  sum  of  N  elemmts  in  an  array.  The  array 
summation  makes  this  well  suited  for  DSP  operation.  To  reduce  roundoff 
errors,  it  may  be  wise  to  center  the  data  beforehand. 


C  code 


OSP  coda 


float  CSUM  (  lnt  N,  float  SX[].  Int  INCX  )  CSUM; 

{  register  Int  1,  n; 
static  float  Sijffl; 

Skjm  >  0.0;  /*  Initial  sum  */ 

for  (1»n»0;  n<N;  1+«  INCX,  m-f) 

sun  SX(1]:  /*  sun  all  elements  */ 

return(sun); 

)  /•***•  End  CSUM 

csuni: 


r14  -  rU  -  12 

r16  -  •r14++r19  /•  INCX  •/ 

r3  ■  •r14<->r19  /•  SX  •/ 

r17  -  •r14+4.r19  /•  N  •/ 

r2  -  A  ZERO 

r17  -  rT7  -  2 

r16  -  r16  •  2 

r16  «  r16  •  2 

aO  -  •rZ 

If  (r17 —  >■  0)  goto  csuni 
aO  ■  aO  t  •r34-fr16 
return  (r18) 
nop  /»•  End  CSUM  •“/ 


CSUMSQ 


Prototype : 
Arguments  : 


Description  : 


float  CSUMSQ  ( int  N,  float  SX(  ],  int  INCX,  float  SY[  ],  int  SY[  ] ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  array  x 

INCX  integer  increment  for  array  x 

SY[  ]  floating  point  array  y 

INCY  integer  increment  for  array  y 

CSUMSQ  calculates  the  cumulative  sum  of  a  product  of  a  value  squared  and  another  value.  If 
either  INCY  or  INCX  is  set  to  aero  then  it  becomes  a  product  of  an  array  and  a  fixed  value. 
This  is  suitable  for  calculating  density  function  variances.  SDOT  can  be  used  to  calculate 
means  in  a  similar  manner.  Array  multiplication  and  addition  makes  this  well  suited  for  DSP 
operation. 
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C  ootMi 


DSP  ood* 


float  CSUMSQ  (  Int  N,  float  SXt],  1nt  INCX, 
float  SY[].  int  INCY  ) 

{  register  int  1,  J,  n; 
static  float  sum; 

sun  ■  0.0;  /*  initial  value  */ 

for  (l»j«n»0;  n<N;  i-falNCX.  j+»INCY,  rvt-f) 
sun  -t-  SX[1]  •  SX[1]  •  SY[i]; 

/•  calculate  sqr(X)*Y  •/ 

return(  sun  ); 

}  /•»"  End  CSUHSQ  •»•/ 


CSUHSQ:  rU  >  rU  -  20 

r15  •  •r14«-r19  /•  INCY  •/ 

r2  *  •rl4++r19  /•  input  array  SY[]  •/ 

r16  «  9  /•  INCX  •/ 

r3  ■  ^144+^9  /•  input  array  SX[]  •/ 

r4  •  ^144-4^9  /•  #  of  elements  N  •/ 

r1  -  A  ZERO 

rl5  -  rTS  •  2 

r15  -  r15  •  2 

rl6  -  rI6  •  2 

r16  -  r16  •  2 

r4  -  r4  -  2 

aO  •  "rl 

csunsq!:  al  •  •rS  •  •r24-4r15 

nop 

if  (r4 — >«0)  goto  csunsql 
aO  «  aO  4  al  •  •r34-4r16 
return  (r18) 
nop  /•"  End  CSUMSQ 


DIST 


Prototype : 
Arguments  : 


Description  : 


float  DIST  ( int  N.  float  SX[  1.  int  INCX.  float  SY[  1,  int  INCY  ) 

N  dimension  or  number  of  elements  in  array 

SX[  ]  floating  point  array  x 

INCX  integer  increment  for  array  x 

SY[  ]  floating  point  array  y 

INCY  integer  increment  for  array  y 

DIST  calculates  the  square  of  the  Euclidean  distance  between  two  vectors.  This  is  well  suited 
for  the  DSP  when  the  dimension  of  the  vectors  (N)  becomes  much  larger  than  2. 


C  code 

DSP  code 

float  OIST  (  int  N,  float  SX[],  int  INCX. 

OIST: 

rl4  .  rl4  -  20 

rl6  »  •rl4+4-rl9  /•  INCX  •/ 

float  SY[],  int  INCY  ) 

rl  =  ^144-*.^  9  /•  SX  •/ 

{  register  int  1,  j,  n; 

rl7  *  •rl444.rl9  /•  INCY  •/ 

static  float  out,  temp; 

r2  •  ^1444-^9  /•  SY  •/ 

out  >  0.0; 

r3  •  ^1444^9  /•  N  •/ 

if  (N<0) 

r4  .  A  ZERO 

return(O.O); 

r17  .  rl7  •  2 

for  (iaj«n»0;  n<N;  i^alNCX,  j4-»INCY,  n4-4) 

rl7  .  rl7  •  2  /•  inc  y  •/ 

{ 

rl6  •  rl$  •  2 

temp  «  SX[i]  -  SY[1]; 

rl6  »  rl6  *2  /•  inc  x  •/ 

out  4>  tei^  *  temp; 

aO  »  ‘rA  /•  load  zero  •/ 

} 

r3  »  r3  -  2  /•  N  -  2  counter 

return!  out  ); 

distl: 

al  »  •rl44rl6  -  •r244rl7 

)  /••*•*  End  OIST  **••/ 

nop 

if  (r3 —  >=0)  goto  distl 
aO  >  aO  4  al  *  al 
return  (rl8) 
nop  /**•  End  OIST  •••/ 
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_ EXPSM 

Prototype  :  void  EXPSM  (int  N.  float  SX[  ].  int  INCX,  float  SY[  ].  iot  INCY,  float  ALPHA  ) 

Arguments  :  N  number  of  elements  in  array 

SXI J  input  floating  point  X  array 

INCX  int^er  increment  for  x  array  _ ! _ 

SY[  ]  input  floating  point  y  array 

INCY  integw  increment  for  y  array 

ALPHA  smoothing  parameter 

Description  :  EXPSM  filters  an  input  array  using  an  exponential  smoothing  algorithm.  Array  nniltiplication 
and  addition  makes  this  well  suited  for  DSP  operation.  Other  filtering  operations  such  as  IIR 
and  FIR  are  available  in  the  AT&T  DSP  libra^. 


C  coda 

void  EXPSM  {  Int  N.  float  SX[].  1nt  INCX, 
float  SY[].  int  INCY,  float  ALPHA  ) 

{  ragistar  int  1,  J,  nj 
static  float  tamp: 

SY[0]  «  SX[0]:  /•  Initial  starting  valuas  •/ 

J  »  0; 

for  (1«n«0;  n<N;  1-faINCX,  n++) 

{  tamp  -  SY[J>.INCY]  •  (  1  -  ALPHA  ); 

SY[J]  -  tamp  ♦  ALPHA  •  SX[1]; 

/•  (1  -  alpha)*SY  ♦  alpha*SX  •/ 

) 

)  /*•«"•  End  EXPSM  •**»/ 


DSP  coda 

EXPSM:  •r14>4-r19  -  a2  •  a2 

nop 

rl4  -  r14  -  28  /•  (6+1 )»4  •/ 

«2  »  •r14++rl9  /•  ALPHA  •/ 
rl7  -  •rl4++rl9  /•  INCY  •/ 
r1  .  •rl4++r19  /•  SY  •/ 
rl6  -  •rl4+:-rl9  /•  INCX  •/ 
r3  -  •rl4++rl9  /•  SX  •/ 
rl5  «  •rl4^-t-rl9  /•  N  •/ 
r4  -  A  ONE 

al  «  -a2  +  •rA  /•  load  1 -ALPHA  •/ 

rl6  ■  rl6  •  2 

rl6  ■  rl6  •  2 

rl5  •  rl5  -  2 

If  (ml)  goto  axpsma 

•rl  >  aO  ■  ^3++^$  /•  SY(0)  -  SX(0)  •/ 
rl5  «  rl5  -  1 
rl7  .  rl7  •  2 
rl7  .  pl7  •  2 
r4  .  rl 
r4  ■  r4  +  r17 
axpsffll:  aO  >  al  * 

If  (rl5 —  >»  0)  goto  axpsml 
*r4^r17  •  aO  >  aO  +  a2  *  •r3++r16 
axpsma:  a2  >  *rl4 

ratum  (rl8) 

nop  /•••  End  EXPSM  •••/ 


FILL 


Prototype  :  void  FILL  ( int  N,  float  SX[  ],  int  INCX,  float  START,  float  STEP  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  x  array  to  be  filled 

INCX  integer  increment  for  x  array 

START  starting  value  for  fill 

STEP  step  value  for  fill 

Description  :  FILL  creates  an  array  of  floating  point  values  based  on  a 

starting  value  and  a  step  size.  This  can  be  used  to  zero  an  array,  etc.  The  use  of  accumulators 
for  incrementing  nukes  this  well  suited  for  DSP  operation. 
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C  cod* 


I 


void  FILL  (  int  N.  float  SX[],  int  1(0, 
float  START,  float  STEP  ) 

(  register  int  1,  nj 

SX[0]  •  START;  /•  store  starting  value  •/ 
for  (i«INCX,  n-1;  n<N;  i*-lNCX.  n**) 

SX(i]  .  SX[i-INCX]  ♦  STEP; 

/*  store  each  value  1  step  apart  */ 
)  /«««»«  End  fill  •*»/ 


OSP  code 

FILL;  r14  .  rl4  -  20 

.  •r144-frl9  /*  STEP  •/ 
aO  «  •rl4^rl9  /•  START  •/ 
r15  »  ‘rlA+^rlS  /•  INCX  •/ 
r3  ■  •rl4+4-rl9  /•  output  array  SX[]  •/ 
r4  •  •r14++rl9  /•  I  elements  N  •/ 
rl5  -  rl5  •  2 
r15  «  rl5  •  2 
r4  -  r4  -  2 
if  (mi)  goto  fille 
•r3++rl5  ■>  aO  •  aO 
r4  -  r4  -  1 

filll;  if  (r4— >-0)  goto  filll 

•rJw-rlS  >  aO  ■  aO  -t-  al 
fille;  return  (rl8) 

nop  /•«  End  FILL  •«/ 


FLOATA 


Prototype : 
Arguments : 


Description  : 


void  FLOATA  ( int  N.  int  X[  1,  int  INCX,  Host  SY[  ]. 

N  number  of  elements  in  array 

X[  ]  input  integer  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  output  floating  point  y  array 

INCY  integer  increment  for  y  array 

FLOATA  converts  an  array  of  integer  values  to  an  array  of  floating  point  numbers.  For  each 
array  element,  a  single  DSP  instruction  is  needed  to  convert  from  integer  to  float. 


t  INCY  ) _ 

y^  =  (float)  X,  ;  t  =  l,.  ,N 


C  code 


DSP  code 


void 

{ 


FLOATA  (  Int  N.  int  X(].  int  INCX, 
float  SY[].  int  INCY  ) 
register  int  i,  j,  n; 


for  (iaj=n=0;  n<N;  1+»INCX,  J+«INCY,  n++) 
SY[J]  .  (float)  X[1]; 

/•  cast  integer  array  to  float  •/ 
}  /•••••  End  FLOATA  ••••/ 


FLOATA;  rl4  .  rl4  -  20 

rl7  .  •rl4-M.r19  /*  INCY  •/ 
rl  >  VlA^+rlS  /•  SY  •/ 
rl6  »  ^14^19  /•  INCX  •/ 
r3  =  •rl4^rl9  /•  X  •/ 
rl5  •  •rl4++rl9  /•  N  •/ 
rl7  .  rl7  •  2 

rl7  .  rl7  •  2 

rl6  .  rl6  •  2 

rl5  •  rl5  -  2 

floatal:  if  (rl5 —  >«0)  goto  floatal 

•rl++rl7  «  aO  »  float(*r3++rl6) 
return  (rl0) 

nop  /•••  End  FLOATA  ***/ 


HEAP 


Prototype : 
Arguments  : 


Description  : 


void  HEAP  (  int  N,  float  SX[  ],  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

HEAP  does  an  in-place  heap  sort  in  ascending  order 
on  the  floating  point  array  SX.  This  routine  is  not 
optimal  for  the  DSP  because  of  the  frequent  use  of 
integer  operations.  However,  the  low  level  optimization  increases  the  speed  by  30%  over  the 


X  -  sortiX), 

where  x,  <  Xj  ,  ,  x^_i  <  Xjy 
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DSP  C-compiled  versiaa. 


Loo^ 

old  HEAP  (  int  N,  float  SX[],  Int  INCX  } 

{  ragistar  itrt  1,  lr,  J,  1|  /•  indloaa  •/ 

static  float  taispi  /*  taap  storaga  •/ 

1  ■  M  •  INCX;  /•  right  and  of  array  •/ 

1r  -  (N  -  1)  •  INCX; 
for  (;;) 

{  If  (  1  >  0  )  /•  still  hiring  •/ 

{  1  —  INCX; 
taap  «  SXtl]; 

) 

a1aa  /*  proaotlon  ratirsaant  phaaa  */ 

{  taap  -  SX[lr];  /•  ratira  top  of  haap  •/ 
SX[1r]  ■  SX[0];  /•  to  and  of  array  •/ 

1r  —  INCX; 

If  (  1r  —  0  ) 

{  SX[0]  ■  taap;  /*  dona  with  sort  */ 
return; 

) 

) 

1  -  1; 

J  ■  1  «  0; 

/*  sift  down  taap  to  propar  position  */ 
whlla  (  J  <■  lr  ) 

{  If  (  J  <  1r  M  SX[J]  <  SX[J>INCX]  ) 

J  »  INCX; 

If  (  taap  <  SX[J]  )  /«  daaota  taap  •/ 

(  SX[11  .  SX[J]; 

1  ■  J5 
J  *m  INCX; 

1 

a1sa  /*  taap's  propar  position  */ 
J  ■  lr  ♦  INCX; 

} 

SX[1]  >  tamp;  /•  plaoa  tamp  •/ 

} 

)  /•«••  End  HEAP  •«•/ 


06P  ooda 

idafina  ragi  rl 
Idafina  ragir  r2 
Idafina  ragj  r3 
Idafina  ragi  r4 
Idafina  rbasa  r5 
Idafina  ragn 
Idafina  arra  aO 
Idafina  ragi  r7 
Idafina  ra^  r6 
Idafina  rinc  r16 
Idafina  roopy  r17 

HEAP:  ^14^19  -  r5 

•rl4«rl9  -  r€ 

•r14-Mr19  ■  r7 
•rl4^rl9  -  r€ 

-  rl4  -  28  /•  (4+3)*4  •/ 

rinc  -  •r14^r19  /•  INCX  •/ 
rbasa  -  •rl4Mr19  /•  ARRAY  •/ 
ragn  -  "rlA^IR  /•  NUMBER  •/ 
r14  -  rU  4.  16 
haap:  rinc  ■  rlnc*2 

rinc  ■  r1nc*2 
•r14  •  rinc 
a1  -  f1oat(  VU  ) 

/•  a1  holds  floating  INCX  •/ 

ragi  ■  ragn/2 

V14  -  ragi 

aO  -  f1oat(  'rU  ) 

/•  aO  holds  floating  N/2  •/ 
rcopy  ■  rbasa 
nop 

aO  -  aO  •  a1 

•r14  >  aO  >  Int(aO) 
ragir  ■  ragn  -  1 
nop 
nop 

ragi  »  VIA 
nop 

ragi  ■  ragi  *  rbasa 

/•  ragi  points  to  cantar  alamant  •/ 

•r14  ■  ragir 

aO  »  f1oat(  "rlA  ) 

nop 

nop 

aO  >  aO  *  a1 

•r14  ■  aO  ■  Int(aO) 

nop 

nop 

nop 

ragir  ■  •rIA 
nop 

ragir  •  ragir  ♦  rbasa 
rl4  -  r14  -  16 
L10:  ragi  -  rbasa 

1f(aq)  goto  L11 
arra  ■  ’ragir 
ragi  »  ragi  -  r1nc 
arra  ■  ’ragi 
goto  LI  3 
ragi  •  ragi 
L11:  nop 

•ragir  ■  al  »  •rbasa 
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nop 

r«g1r  •  rag1r  -  r1nc 
r«g1r  -  rbas* 

1f(ne)  goto  L13 
ragi  «  ragi 
*rt>asa  >  a1  >  arra 
r5  •  •r14++rl9 

r«  . 

r7  »  •r14++rl9 

re  -  •r14++rl9 

ratum  (r18) 
r14  >  r14  -  16 
L13:  ragj  ■  rag1 

rag2  *  ragi 
rag2  «  rag2  -  rcopy 
ragJ  •  ragJ  ♦  reg2 
ragJ  »  ragj  +  rinc 
L14:  ragj  -  ragir 

1^(9t)  goto  L35 
nop 

ragj  -  ragir 
lf(ga)  goto  L30 
nop 

rag2  «  ragj 

rag2  »  rag2  +  rinc 

a1  «  ‘ragj  -  •rag2 

nop 

nop 

nop 

If(aga)  goto  L30 
nop 

ragj  »  ragj  ♦  rinc 
L30:  a1  »  arra  -  ‘ragj 

rag2  •  ragj 
rag2  ■  rag2  -  rcopy 
rag2  ■  rag2  +  rinc 
If(aga)  goto  L33 
nop 

•ragi  »  a1  «  “regj 
nop 

ragi  «  ragj 
goto  L14 


ragj  •  ragj  + 

rag2 

L33: 

ragj  »  ragir 

goto  L14 

ragj  »  ragj  + 

rinc 

L35; 

•ragi  =  a1  »  arra 

goto  L10 

nop  /•••  End  HEAP  •••/ 


HISTOG 


Prototype  : 


Arguments  : 


void  HISTOG  ( int  N,  float  SX[  ],  iot  NPLOT,  float  •SCALE,  float  •UP,  float  •LOW,  float 

bin  =  Roundi  SCALE  x  -  LOW)), 
Histffiin)  -  Histf)>in)  +  1.0 


HIST[  ]  ) 

N 

Number  of  elements  to  bin 

SX[] 

Array  of  floating  point 
elements  to  bin 

NPLOT 

Number  of  bins  in  histogram 

•SCALE 

Pointer  to  histogram  scaling 
factor 

•UP 

Pointer  to  maximum  bin  value 

•LOW 

Pointer  to  minimum  bin  value 
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HIST[  ]  Histogram  floating  point  array 

Description  :  HIST  bins  incoming  values  according  to  their  floating  point  magnitude.  If  NPLOT  is  0,  all 
values  are  placed  in  the  same  histogram  array,  HIST.  If  value  is  greater  than  UP.  the 
maximum  bin  value  is  incremented.  If  value  is  less  than  LOW,  the  minimum  bin  value  is 
incremented.  This  routine  is  not  c^timal  for  the  DSP  because  of  tlw  nops  introduced  when 
binning  the  parameters.  Compiling  directly  from  C  code  and  using  other  low-level  routines 
will  increase  flexibility. 


C  code 

void  HIST06  (  int  N.  float  SX[],  int  NPLOT. 
float  •SCALE,  float  ‘UP.  float  ‘LOW,  float  "HIST  ) 
{  register  int  i,  bin,  ptr; 

for  (i«0,  ptr«0i  i<N;  i++,  ptr+«NPL0T) 

/•  ptr  -  i  •  NPLOT  •/ 

{  bin  =  (int)  (  ‘SCALE  • 

(  LIMIT(  ‘LOW.  "UP.  SX[i]  )  -  ‘LOW  )); 

/•  calculate  bin  •/ 
HIST[ptr  +  bin]  +«  l.Oi 

) 

}  /•••••  End  HIST06  ••••/ 


OSP  code 

HISTOG:  ‘rlA^rig  «  r5 

•r14^rl9  •  r7 
•rl4++rl9  «  r9 
•rl4++rl9  «  rll 
•rl4++rl9  »  r12 
•rl4++rl9  >  a3  =  a3 
nop 

rl4  -  rU  -  52  /•  (&t-7)»4  •/ 

r3  «  •rl4++rl9  /•  HIST  array  •/ 

r5  •  •rl4^rl9  /•  LOW  limit  •/ 

rll  =  •rl4++rl9  /•  UPPER  limit  •/ 

rl2  -  •rl4++rl9  /•  SCALE  •/ 

rl7  «  •rl4++rl9  /•  NPLOT  intervals  •/ 

rl  «  •rl4++rl9  /•  INPUT  array  •/ 

rl6  .  •rl4++rl9  /•  N  data  •/ 

rl5  »  r3  /•  copy  of  histogram  •/ 

r9  •  A  TEMP 

r2  -  A~ZER0 

r4  =  a“0NE 

a3  .  *711  -  “rS 

/•  upper  -  lower  limit  •/ 

rl7  .  r17  •  2 

rl7  .  rl7  •  2 

/•  histogram  block  length  •/ 
rl6  .  r16  -  2 
a3  «  a3  •  •r12 

/•  (upper  -  lower)  •  scale  */ 

histl:  aO  =  ‘rl  -  ‘rS 

/•  prob  value  -  lower  limit  •/ 

nop 

nop 

aO  =  aO  •  *rl2  /•  scale  •/ 
aO  =  ifalt(»r2) 

/•  if  aO  <  0  then  aO  =  0  •/ 
al  »  ‘rll  -  •rl++ 

/•  upper  limit  -  prob  value  •/ 
aO  »  ifalt(a3) 

/•  if  al  <0,  aO  -  up  -  low  •/ 

•r9  «  aO  »  int(aO) 

/•  histogram  length  •/ 

nop 

nop 

nop 

r7  »  •r9  /•  load  number  into  register  •/ 
r3  =  rl5 
r7  =  r7  •  2 

r7  =  r7  *  2  /•  multiply  by  4  •/ 

r3  *  r3  +  r7  /•  offset  histogram  •/ 

•r3  =  aO  =  •r4  +  ‘rS 
/•  increment  histogram  •/ 
if  (rl6 —  >=0)  goto  histl 
rl5  =  rl5  +  rl7 
/•  new  histogram  location  */ 
r5  =  •rl4++r19 
r7  =  ‘rlA^-t-rig 
r9  »  •rl4++rl9 
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r^^  »  ^14++^  9 
H2  .  •r14++r19 
•3  .  •r14++r19 
return  (r18) 

r14  .  r14  -  24  /•«  End  HISTOG  «•/ 


Prototype  :  float  HORN  ( int  N,  float  COEF[  ],  float  X  ) 

Arguments  :  N  number  of  coefficients  in  polynomial 

COEF[  ]  Homer’s  algorithm  polynomial  coefficient  array 
X  floating  point  x  value 

Description  :  HORN  evaluates  a  polynomial  expression  according  to  Homer’s 
algorithm.  This  algorithm  has  a  vector  dependence  which 
introduces  a  nop  in  the  DSP  code  and  thereby  increases  the 
execution  time  by  ~S0%  over  no  dependence. 


C  code 

OSP  code 

float  HOflN  (  1nt  N,  float  COEF[].  float  X  ) 

HORN: 

rU  «  r14  -  12 

{  register  Int  i; 

a1  •  ‘ria^-t-rlR  /•  X  Input  •/ 

static  float  horn; 

r3  «  ‘ria+^rlR  /•  COEF  •/ 
r2  •  •r14++r19  /•  N  •/ 

horn  .  CnEF[0]j 

nop 

for  (1»1;  1<Nj  1++) 

r2  ■  r2  -  2 

horn  »  00EF[1]  +  horn  •  X; 

If  (ml)  goto  horne 

return! horn); 

aO  »  •r3++ 

}  /•*»*•  End  HORN  •***/ 

r2  »  r2  -  1 

hornl: 

nop 

If  (r2—  >»0)  goto  homl 
aO  »  •r3++  ♦  aO  •  a1 

horne: 

return  (r18) 

nop  /•••  End  HORN  •••/ 

_ INTA 

void  INTA  (  int  N.  float  SX(  ],  int  INCX,  int  Y[  ],  int  INCY  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

Y[  ]  output  integer  y  array 

INCY  integer  increment  for  y  array 

INTA  converts  an  array  of  floating  point  values  to  an  array  of  integer  numbers.  This  requites 
a  single  DSP  operation  per  element. 


Prototype : 
Arguments  : 


Description  : 


HORN 
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C  coda 

void  INTA  (  irvt  N.  float  SXt],  int  INCX, 
int  Y[],  int  INCY  ) 

{  rogistar  Int  i,  ji  n; 

n  .  N  •  INCX; 

for  (1.0.  J-0;  Kn;  i+.INCX,  J+.INCY) 

/•  go  through  array  •/ 
V[J]  -  (int)  SX[1]; 

/•  casting  valuas  as  integars  •/ 

)  /•*•••  End  INTA  ****/ 


DSP  coda 

INTA:  r14  .  r14  -  20 

r17  .  •r14++r19  /•  INCY  •/ 
r1  .  •r14<-:-r19  /•  Y  •/ 
r16  -  •r14w-r19  /•  INCX  •/ 
r3  .  •r14^r19  /•  SX  •/ 
r15  «  •r144-*-r19  /•  N  •/ 
rl7  .  r17  •  2 

r16  .  r16  •  2 

r16  .  r16  •  2 

r15  -  r15  -  2 

intal:  if  (rIS--  >.0)  goto  intal 

•r1++r17  .  aO  .  int(*r3*+r16) 

raturn  (rlB) 

nop  /•"  End  INTA  •«/ 


ISAMAX 


Prototype  :  int  ISAMAX  ( int  N.  float  SX[  ],  int  INCX  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  array 

INCX  array  integer  increment  or  stq) 

Description  :  ISAMAX  finds  the  maTimiini  absolute  value 

of  an  array  and  returns  the  index  corresponding  to  that  array.  Adapted  from  BLAS  library. 
This  operation  is  more  lengthy  than  other  array  operations  but  still  well  suited  for  DSP 
operation. 


C  coda 

int  ISAMAX  (  int  N,  float  SXC],  int  INCX  ) 
{  register  int  i,  n,  imax; 
static  float  smax; 

itiHx  .  0; 

smax  .  fabs(  SX[0]  ); 
for  (i.INCX,  n.l;  n<N;  1  ♦.  INCX,  n++) 
if  (  fabs(SX[l])  >  smax  ) 

{  imax  .  i; 
smax  =  fabs(  SX[i]  ); 

) 

return(  imax  ); 

)  /•••»  End  ISAMAX  ****/ 


DSP  coda 

ISAMAX:  rl4  >  rU  -  12 

r16  .  •rl4>+r19  /•  INO  •/ 
r2  .  •rl4++rl9  /•  SX  •/ 
r3  .  *rl4+*r19  /•  N  •/ 
rl7  .  r16  •  2 

r17  .  rl7  •  2  /•  float  inc  x  •/ 

aO  .  -^2  /•  initial  maximum  •/ 

aO  •  ifalt(*r2++rl7) 

r3  •  r3  -  2  /•  N  -  2  counter  •/ 

if  (mi)  goto  isamaxa 

r3  .  r3  -  1 

r15  .0  /•  initial  index  •/ 

rl  .  rl5 

isamaxl:  al  .  -•r2 

al  •  lfalt(»r2i-*-rl7) 
a1  .  -  al  -t-  aO 

r4  .  rl  /•  save  old  max  •/ 

rIS  .  r15  ♦  r16 

rl  .  rl5  /•  store  max  •/ 

if  (ale)  goto  newmax 

nop 

if  (r3 —  >*0)  goto  isamaxl 
rl  .  r4  /•  store  old  max  •/ 

newmax:  if  (r3 —  >*0)  goto  isamaxl 

aO  .  -  al  +  aO 
isamaxe:  return  (r18) 

nop  /•»  End  ISAMAX  •••/ 


122 


LIMIT 


Prototype : 
Arguments  : 


Description  : 


float  UMIT  (  float  MIN.  float  MAX.  float  VALUE  ) 
MIN  minimum  clamping  level 

MAX  maximum  clanq>ing  level 

VALUE  input  value  x 

LIMIT  clamps  the  input  value  to  the  upper 
or  lower  limit  if  x  does  not  fldl  within  its 
range.  This  function  has  considerable 
subroutine  call  overhead  but  is  <^timized 
with  respect  to  the  DSP  C-compiled  version. 


MIN, 

X  <  MIN, 

w  -  > 

MAX, 

X  >  MAX, 

> 

. 

MIN  i  X  i  MAX. 

C  code 


OSP  code 


float  LIMIT  (  float  MIN.  float  MAX. 
float  VALUE  ) 

{  if  (  VALUE  >  MAX  ) 

return!  MAX  );  /•  check  upper  limit  •/ 

if  (  VALUE  <  MIN  ) 

return!  MIN  );  /•  check  lower  limit  •/ 

return!  VALUE  ); 

)  /•«»•  End  LIMIT  *•«/ 


LIMIT:  rU  .  rl4  -  12 

aO  »  •r14-M-r19  /•  start  with  VALUE  •/ 
al  »  -aO  +  "rlA  /•  compare  with  MAX  •/ 
aO  -  ifa1t!*r14++rl9)  /*  switch  with  MAX  •/ 
al  •  aO  -  VIA  /•  compare  with  MIN  •/ 
aO  •  ifalt!*rl4++rl9)  /•  switch  with  MIN  •/ 
return  !rl8) 

nop  /***  End  LIMIT  •«/ 


MAC 


Prototype  : 
Arguments  : 


Description  : 


void  MAC  (  int  N.  float  SX[  ].  int  INCX,  float  SY[  ],  int  INCY,  float  SZf  ],  int  INCZ,  float 
SA) 

N  number  of  elements  in  array 

SA  floating  point  scale  factor,  a 

SX[  ]  floating  point  x  array 

INCX  integer  array  increment  for  x  array 

SY[  ]  floating  point  y  array 

INCY  integer  array  increment  for  y  array 

SZ[  ]  floating  point  z  array  foutput) 

INCZ  integer  array  increment  for  z  array 

MAC  (multiply-accumulate)  is  the  elementary  vector  operation  z  =  x  +  ay.  This  is  a  single 
DSP  instruction  per  element  so  it  is  well  suited  for  DSP  operation. 


Z  ^  X  *  ay 
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DSP  oode 


void  MAC  (int  N.  float  SX[].  int  INCX,  float  SY[],  MAC: 
int  INCY,  float  SZ[].  int  INCZ,  float  SA  ) 

{  register  int  n,  i,  j,  k; 

if  (N  <  0) 
return; 

for  (1«j«k«n»0;  n  <  N; 

i+=INCX.  J+=INCY,  k+»INCZ,  n*+) 

SZ[k]  =  SX[i]  +  SA  •  SY[J]; 

}  /••»••  End  MAC  ••••/ 


rl4  .  rl4  -  32 
a1  .  •r14++rl9  /•  SA  •/ 
rl5  .  ‘rlA^+rig  /•  INC  Z  •/ 
r2  ■  “rlA^+rig  /•  SZ  •/ 
r17  -  ‘rl^M-rig  /•  INCY  •/ 
r4  -  •rl4++r19  /•  SY  •/ 
rl6  -  'rlAv+rig  /•  INCX  •/ 
r3  •  *r14++r19  /*  SX  •/ 
r1  >  •r14++rl9  /•  N  •/ 
rl6  •  rl6  •  2 

r16  -  r16  •  2  /•  inc  x  •/ 
r15  -  r15  •  2 
rl5  .  r15  •  2  /•  inc  z  •/ 
rl7  -  r17  •  2 

rl7  =.  r17  •  2  /•  inc  y  •/ 
rl  «  r1  -  2  /•  N  -  2  counter  •/ 

macl:  if  (rl —  >=0)  goto  tnacl 

•r2++r15  •  aO  »  •r3++rl6  +  al  •  •r4++rl7 

return  (rl8) 

nop  /•••  End  MAC  •*•/ 


MATMULT 


Prototype  : 
Arguments  : 


Description  : 


void  MATMULT  (int  M,  int  N,  int  P,  float  MATA[  ],  float  MATB[  ],  float  MATC[  ]) 

M  number  of  rows  in  matrix  A 

N  number  of  columns  in  matrix  A 

and  rows  in  matrix  B 

P  number  of  columns  in  matrix  B 

MATA[  ]  input  matrix  A 

MATB[  ]  input  matrix  B 

MATC[  ]  output  matrix  C 

MATMULT  multiplies  two  matrices  together  and  returns  the  result  in  matrix  C.  Matrices  are 
sent  as  one-dimensional  arrays.  Given  that  the  matrices  are  arranged  in  row-major  order,  this 
operation  is  well-suited  for  DSP  operation.  Other  variations  of  matrix  multiply  that  are  not 
include  in  this  appendix  are  MATMULTl,  MATMULT2,  MATMATT,  and  MATTMAT. 


C  =  A  X  B, 
B€R^^,and 


Prototype  :  void  MATMULTl  (int  M,  int  N,  float  MATA[],  float  MATB[],  float  MATC[]) 
C  =  A  X  where  BeR'*",  and  CeR"'^ 


MATMULTl 


Prototype  ;  void  MATMULT2  (int  M,  int  N,  float  MATAfl,  float  MATB[],  float  MATCH) 

C  =  A^  X  B,  where  AeR"^,  and  CeR^’^  MATMULT2 

Prototype  :  void  MATMATT  (int  N,  int  P,  float  MATA[],  float  MATC[]) 

C  =  A  xA^,  where  and  CeR'^’^  MATMATT 

Prototype  :  void  MATTMAT  (int  N,  int  P,  float  MATA[],  float  MATC[]) 

C  =  A  X  A.  where  AeR'^’^  and  CgR'^  MATTMAT 
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C  code 


DSP  code 


void  MATMULT  (  Int  M.  1nt  N.  1nt  P, 

float  MATA[].  float  MATBt],  float  MATC[]  ) 
(  register  1nt  m,  n,  p; 
static  float  sum; 

for  (ni«0j  bkM;  «►*+) 
for  (p-O;  p<P!  pi-t-) 

{  sum  s  0.0; 
for  (n«0;  n<N;  n++) 
sum  +»  MATAim*N  +  n]  •  MATB[n*P  +  p]; 
MATC[m*P  +  p]  »  sum; 

} 

)  /••***  End  MATMULT  •••*/ 


MATMULT:  "rlA+^rlB  •  r5  /•  save  user  regs  •/ 

•r14++rl9  •  rS 
•rl4++rl9  •  r7 
•r14++rl9  «  rB 
•rl4++rl9  -  r9 
•rl4++rl9  •  rIO 
•rl4++rl9  •  rll 
nop 

Pl4  -  rl4  -  52  /•  (7+6)  •  4  •/ 

r6  •  •rl4++rl9  /•  address  of  C[0,0]  •/ 

rS  «  •rl4++rl9  /•  address  of  B[0,0]  •/ 

r4  -  •rl4++rl9  /•  address  of  A[0.oi  •/ 

r3  •  •rl4++r19  /•  P  •/ 

r2  •  •rl4++rl9  /•  N  •/ 

rl  .  •rl4++rl9  /•  M  •/ 

r7  •  r5  /•  points  to  Bll  •/ 

rS  •  A  ZERO 

al  .  ‘rS 

rl5  ■  r3  •  2 

rl5  -  rl5  •  2  /•  rl5  -  4P  •/ 
rl  «  rl  -  2  /•  loop  counter  for  M  •/ 
matnulA:  rll  ■  r3  -  2  /•  loop  counter  for  P  •/ 

matmulB:  aO  >  al 

rB  «  r4  /•  points  to  Aik,  Init  1»k«1  •/ 

r9  •  r7  /•  points  to  BkJ,  Init  k«j»1  •/ 

/•  computes  sum  of  A1k*BkJ  •/ 
rlO  -  r2  -  3 

/•  loop  counter  for  k  or  N  •/ 
matmulC:  1f(rl0 —  >«0)  goto  matmulC 

aO  >  aO  +  •rBi'+rlS  •  ^8++ 

/•  k  Is  the  variable  •/ 

•r6++  »  aO  »  aO  +  •r9++  •  •r8++ 

/•  stores  C1J  •/ 

If(r11--  >«0)  goto  matmulB 
r7  »  r7  +  4  /•  Inc  J  and  repeat  */ 

r4  »  rB  /•  Inc  1  or  start  of  next  row  •/ 

1f(rl —  >*0)  goto  matmulA 
r7  »  r5 

/•  restore  pointer  to  point  to  Bll  •/ 

r5  «  •rl4++rl9 

r6  »  •rM+'+rlB 

r7  »  •rl4++rl9 

rB  .  •rl4++rl9 

r9  =  •rl4++rl9 

rlO  =  •rl4++rl9 

rll  =  •rl4++rl9 

return  (r18) 

r14  .  rl4  -  28  /•»•  End  MATMULT  **•/ 


MATVEC 


Prototype  :  void  MATVEC  (int  M,  int  N,  float  MATA[  ],  float  VECB[  ],  float  VECC[  ]) 

Arguments  :  M  number  of  rows  in  matrix 

N  number  of  columns  in  matrix 

MATA[  ]  floating  point  matrix 

VECB[  ]  input  floating  point  vector 

VECC[  ]  output  floating  point  vector 
Description  :  MATVEC  multiplies  an  M  xN  matrix  by  a  vector  of  length  N.  The  matrix  is  stored  as  a  one 
dimensional  array.  Given  that  the  matrix  is  arranged  in  row-major  order,  this  operation  is 
well-suited  for  DSP  operation. 
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C  code 


DSP  cod> 


MATVEC:  r14  .  r14  -  20 

r1  «  •r14++r19  /•  address  of  C[0]  •/ 
r2  •  •r14++r19  /•  addrass  of  B[0]  •/ 
r4  -  •r14++r19  /•  addrws  of  Aio.O]  •/ 
r16  .  •r14^r19  /•  N  •/ 
r15  -  •r14^rl9  /•  M  •/ 
r3  -  A  ZERO 
a1  -  "rS 
nop 

r15  »  r15  -  2  /•  loop  counter  for  M  •/ 
r3  •  r2 

/•  points  to  Bk,  Initially  k«1  •/ 
inatvocl:  aO  »  a1  /•  computes  sum  of  A1k»Bk  •/ 

r17  •  r16  -  3 

/•  loop  counter  for  k  or  N  •/ 
matvec2:  1f(r17 —  >«0)  goto  matvec2 

aO  ■  aO  4-  •r3*-4  •  ‘rA^-f 
/•  k  Is  the  variable  •/ 

•rl4-»  •  aO  •  aO  4  •rSi-*-  •  •r4+4 
/•  stores  Cl  •/ 

If(r15 —  >«0)  pcgoto  matvec! 
r3  «  r2 

/•  points  to  Bk,  Initially  k«1  •/ 
return  (r18) 

nop  /•«  End  MATVEC  •«/ 


MAXA 


Prototype  :  float  MAXA  ( int  N,  float  SX[  ],  int  INCX  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

Description  :  MAXA  flnds  tfae  maximum  value  in  array  x.  This  operation  is  well  suited  for  DSP  operation 
through  the  use  of  the  ifab{)  instruction. 

C  code  DSP  code 

float  MAXA  (  int  N,  float  SX[],  1nt  INCX  )  MAXA: 

{  register  1nt  1,  n; 
static  float  fflx; 

mx  -  SX[0]j 

for  (1«!NCX,  n»1;  n<N;  1*=INCX,  n++) 

/•  go  through  array  •/ 
tnx  »  QMAX(  mx.  SX[1]  ); 

/•  keep  track  of  maximum  •/ 
return(  mx  )j 
)  /*••••  End  MAXA  ••••/ 

maxal: 


mautae: 


rl4  .  rl4  -  12 

rl6  =  •rl4-M.r19  /•  INCX  •/ 

r4  «  •r14++rl9  /•  SX  •/ 

rl7  =  •rl4++rl9  /•  N  •/ 

rl6  .  rl6  •  2 

rl6  -  rl6  •  2 

a1  =  ‘rA  /•  Starting  value  •/ 

aO  «  •rA+'f-rlB 

rl7  .  rl7  -  2 

If  (ml)  goto  maxae 

r17  .  r17  -  1 

al  «  aO  -  ‘rA 

If  (rl7 —  >»0)  goto  maxal 

aO  »  1falt(*r44+rl6) 

return  (rlB) 

nop  /•••  End  MAXA 


void  MATVEC  (  Int  M.  Int  N,  float  MATA(], 
float  VECB[],  float  VECC[]  ) 
{  register  int  m,  n; 
static  float  sum; 

for  (m*0;  bkM;  im-*-) 

{  sum  «  0.0; 
for  (n»0;  n<N;  n++) 

sum  +-  MATAtm*N  ♦  n]  •  VECB[n]; 
VECC[ffl]  >  sum; 

) 

)  /•••••  End  MATVEC  •"•/ 
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MAXIND 


Prototype  : 
Arguments  : 


Description  : 


int  MAXIND  (  int  N.  float  SX[  ].  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

MAXIND  flnds  the  index  of  the  array  x,  with  maximum  value.  This  operatioo  is  closely 
related  to  ISAMAX. 


i  -  x^  *  sup{  Xj  :  J  *  0,...,N  ). 


C  ocxte 


DSP  code 


int  MAXIND  (  int  N,  float  SX[],  int  INCX  ) 

{  register  int  1,  n; 
static  int  indx; 

indx  s  0; 

for  (i-INCX,  n«1;  n<N;  1+-INCX.  n++) 

/•  go  through  array  •/ 

(  SX[i]  >  SX[indx]  ) 

/’  :eep  track  of  index  whose  value  is  max  */ 
indx  >  i; 
return(  indx  ); 

}  End  MAXIND  •*«/ 


MAXIND: 


maxindl: 


newmaxind: 


maxinde: 


r14  »  r14  -  12 
r16  -  •r14++r19  /•  INCX  •/ 
r2  »  •rl4++rl9  /•  SX  •/ 
r3  »  •r14++r19  /•  N  •/ 
r17  -  rl6  •  2 

r17  -  r17  •  2  /•  float  inc  x  •/ 
aO  ■  •r2++rl7  /•  initial  maximum  •/ 
r3  «  r3  -  2  /•  N  -  3  counter  •/ 

if  (mi)  goto  maxinde 
r3  •  r3  -  1 

rl5  «  0  /•  initial  index  •/ 

rl  «  rl 5 
al  =  •r2-f+rl7 
al  •  -  al  aO 

r4  »  rl  /•  save  old  max  •/ 

rlS  »  rl5  +  rl6 

rl  •  rl5  /•  store  max  •/ 

if  (ale)  goto  newmaxind 

nop 

if  (r3 —  >»0)  goto  maxindl 

rl  »  r4  /*  store  old  max  •/ 

aO  »  -  al  +  aO 

if  (r3“  >«0)  goto  maxindl 

nop 

return  (rl8) 

nop  /•••  End  MAXIND  •••/ 


MEAN 


Prototype  : 
Arguments  : 


Description : 


float  MEAN  ( int  N.  float  SX[  ],  int  INCX,  float  FACTOR) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  incremoit  for  x  array 

FACTOR  floating  point  multiplier  (1/N) 

MEAN  calculates  the  average  value  of  an  array  x  of  N  elements  by  using  a 
two  pass  algorithm.  The  first  pass  is  used  to  center  the  data,  and  the  second  pass  oiables  a 
more  accurate  determination  of  the  mean.  FACTOR  is  passed  to  reduce  the  number  of 
unnecessary  divisions.  The  array  additions  make  this  function  well  suited  for  DSP  operation. 
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DSP  codw 


C  coda 

float  MEAN  (  int  N.  float  SX[],  Int  INCX, 
float  FAaOR  ) 

{  register  1nt  1,  n; 

register  float  sunl,  suiii2  •  0.0; 

suffll  -  CS(JM(  N.  SX.  INa  )  •  FAaOfti 
/•  first  pass  mean  •/ 
for  (IsnaO;  n<N;  l-falNCX,  tvM-) 
suni2  ♦«  SX[1]  -  suml; 

/*  secofxl  pass  mean  */ 
return(  suml  ♦  suin2  •  FACTOR  ); 

}  /••»•  End  MEAN  »*•/ 


MEAN:  VlA^-frlD  >  a2  >  a2 

nop 

rU  -  rl4  -  20  /•  (l+4)»4  •/ 

a2  ■  ‘rlA^+rlD  /•  Factor  dividing  •/ 

rl7  ■  •rl4++rl9  /•  INCX  •/ 

rl  -  •rl4++rl9  /•  SX  •/ 

rl5  a  •r14^rl9  /•  N  •/ 

r3  a  A  ZERO 

rl7  a  717  •  2 

rl7  a  rl7  •  2 

rl5  a  rl5  -  2 


rl 


/•  Copy  of  SX  •/ 
estimate  of  mean  */ 


rl6  a  rlS  /•  load  count  •/ 
aO  a  ^rS  /•  load  zero  •/ 

meanll:  If  (rl6 —  >a0)  goto  meanll 

aO  a  aO  t  ‘rAt+rn 
r4  a  rl  /•  Copy  of  SX  •/ 

rl6  a  rl5 
al  a  aO  *  a2 


/aaaaaaaaaaaaMaaaaaaaa  second  pass  mean  •/ 
aO  a  “rB  /•  load  zero  •/ 

mean21:  aO  a  aO  >  •rA*-*-rl7 

If  (rl6 —  >«0)  goto  mean21 
aO  a  aO  -  al 


nop 

nop 

aO  a  al  4-  aO  *  a2 

a2  a  ^14 

return  (rl8) 

nop  /•••  End  MEAN  **•/ 


MEDIAN 


Prototype  :  float  MEDIAN  (  int  N,  float  SX(  ),  int  INCX  ) 


Arguments  :  N  number  of  elements  in  array 

1  .  .  1 

SX[  ]  input  floating  point  x  array 

*  Oddy 

INCX  integer  increment  for  x  array 

m  = 

1, 

Description  :  MEDIAN  finds  the  midpoint  index  of  array  x  and 

♦  *.V2*ll*  "  «««• 

returns  the  value  corresponding  to  that  midpoint. 

This  function  calls  HEAP.  Only  slight  speed 

inqirovement  is  observed  by  making  this  a  low-level  function  call.  Array  X  is  returned  sorted. 
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DSP  code 


float  MEDIAN  (  Int  N,  float  SXt],  Int  INCX  ) 

(  register  int  med; 

HEAP(  N.  SX.  INCX  );  /•  sort  array  •/ 

mad  «  N  /  2; 

if  (  2  •  med  •«  N  )  /•  even  nunber  of  indices  •/ 
retum(  0.5  •  (  SX[(med-1  )*INCX]  + 

SX[med«INCX]  )  ); 

else  /•  odd  nunter  of  indices  •/ 

return(  SX[med*INCX]  ); 

}  /•«•••  End  MEDIAN  •«*/ 


MEDIAN:  r1  •  r14  -  12 

•r14++r19  «  r18  /•  save  return  •/ 
•r14++r19  »  aO  •  •r1++r19  /•  push  new  •/ 
•rl4++rl9  =  aO  «  •r1-f+r19  /•  values  •/ 
•r14++rl9  »  aO  «  •rl++r19 
nop 

call  HEAP  (rl8)  /•  sort  •/ 
medret:  r18  >  medret  -f  4 

r14  -  r14  -  28 
r15  •  •r14++rl9  /•  INCX  •/ 
rl  .  •rl4++rl9  /•  SX[]  •/ 
r2  -  •r14++r19  /•  N  •/ 
rl8  »  “rH  /•  return  addr.  •/ 
r15  -  rl5  •  2 
r15  -  r15  •  2 
•r14  .  r15 
a1  «  float(  •r^A  ) 
r3  •  r2  /  2  /•  N/2  •/ 

•r14  «  r3 

aO  -  float(  ^14  ) 

nop 

nop 

aO  •  aO  *  a1 

•rl4  •  aO  »  int(a0} 

r4  »  r3  •  2 

nop 

nop 

r3  »  •rl4 
nop 

r3  «  r3  +  rl 

aO  ■  •r3  /*  for  odd  •/ 

r4  -  r2  /•  check  even  •/ 
if(ne)  goto  medanend 
r3  »  r3  -  rl5 

aO  >  aO  •r3  /•  add  previous  value  •/ 

r4  .  A_HALF 

nop 

aO  •  aO  •  •r4  /•  divide  by  2  •/ 
medanend:  return  (rl8} 

nop  /•••  End  MEDIAN  •••/ 


MINA 


Prototype  ;  float  MINA  ( int  N,  float  SX[  ],  int  INCX  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

Description  :  MINA  flnds  the  minimum  value  in  array  x.  This  operation  is  well  suited  for  DSP  operation 
through  the  use  of  the  ifalt()  instruction. 
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DSP  code 


float  MINA  (  int  N,  float  SX[],  int  INCX  ) 

(  register  int  1,  n; 
static  float  nn; 

tnn  •  SX[0]; 

for  (1-INCX.  n-1i  n<N;  1+-INCX.  rM-f) 

/•  go  through  array  •/ 
mn  •  qMIN(  am.  SX[i]  ); 

/*  keep  track  of  ailniaus  */ 

return(  am  ); 

)  /•*»•  End  MINA  ••••/ 


MINA;  r14  -  r14  -  12 

r16  •  •r14^r19  /•  INCX  •/ 
r4  -  •r14^r19  /•  SX  •/ 
r17  ■  •r14^rl9  /•  N  •/ 
r16  -  r16  •  2 
r16  -  r16  •  2 

a1  »  •r4  /•  Starting  value  •/ 

aO  -  *r44-fr16 

r17  .  r17  -  2 

if  (ni)  goto  oiinae 

r17  -  r17  -  1 

«ina1:  a1  •  -aO  *  *r4 

if  {r17 —  >«0)  goto  einal 
aO  »  ifa1t(»r4++r16) 
einae:  return  (r18) 

nop  /•«  End  MINA  •«/ 


MINING 


Prototype : 
Arguments  : 


Description  : 


int  MININD  (  int  N.  float  SX[  ].  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

MININD  flnds  the  index  of  the  array  x  with  the  minimum  value.  This  operation  is  the  inverse 
of  MAXIND. 


i  -  X,  ■  mini  }. 


C  code 


DSP  code 


int  MININD  (  int  N,  float  SX[].  int  INCX  ) 

{  register  int  1,  n; 
static  int  indxj 

indx  «  0; 

for  (isINCX,  n«1;  n<Nj  i-fvINCX,  n++) 

/•  go  through  array  •/ 
if  (  SX[1]  «  SX[1ndx]  ) 

/•  keep  track  of  index  whose  value  is  oiin  •/ 
indx  «  i; 
return(  indx  ); 

)  /««»  End  MINING  ••••/ 


MININD: 


oiinind!: 


newminind: 


oiininde: 


r14  .  r14  -  12 
r16  -  •r144^rl9  /•  INCX  •/ 
r2  «  ‘rlA+^rlD  /•  SX  •/ 
r3  •  •rl4++r19  /•  N  •/ 
r17  «  r16  •  2 

r17  «  r17  •  2  /•  float  inc  x  •/ 
aO  »  *rZ*+rM  /•  initial  oiiniown  •/ 
r3  •  r3  -  2  /•  N  -  2  counter  •/ 

if  (oii)  goto  oiininde 
r3  •  r3  -  1 

r15  •  D  /•  initial  index  •/ 

r1  .  r15 
a1  »  •r2++r17 

a1  >  a1  -  aO 

r4  »  r1  /•  save  old  aiin  •/ 

r15  »  r15  ♦  r16 

r1  «  rl5  /•  store  oiin  •/ 

if  (a1e)  goto  newoiinind 

nop 

if  (r3 —  >«0)  goto  oiinindl 

r1  »  r4  /•  store  old  oiin  •/ 

aO  «  a1  aD 

if  (r3 —  >«D)  goto  oiinindl 
nop 

return  (r18) 

nop  End  MININD  •*•/ 
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MINMAX 


Prototype  :  void  MINMAX  ( int  N.  float  SX(  ].  int  INCX.  float  *MIN.  float  *MAX.  float  -GRANGE  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

*MIN  pointer  to  mitiimum  value  in  x  array 

“^MAX  pointer  to  maximum  value  in  x  array 

GRANGE  pointer  to  range  of  values  in  x  array  ( 

MAX  '  MIN  ) 

Description  :  MINMAX  finds  the  minimum  and  maximum  values  within  a  floating  point  array.  Through  die 
use  of  the  ifaU()  instruction,  this  function  is  well  suited  for  DSP  operation. 


C  code 

void  MIWMX  (  int  N,  float  SX[],  int  INCX, 

float  float  "MAX,  float  ‘RANGE  ) 

(  register  int  i,  n; 

•HIN  »  SX[0]i 
•MAX  «  SX[0]; 

for  (iaINCX,  n«l;  n<N;  ivsINCX,  n**) 

/•  go  through  array  •/ 

{  "MIN  «  QMIN(  -MIN,  SX[i]  ); 

•MAX  >  QMAXf  •MAX,  SXfi]  ); 

/•  keep  track  of  both  min  and  max  •/ 

} 

•RANGE  *  •MAX  -  •MIN;  /•  calculate  range  •/ 

}  /••«•  End  MIIMAX  ••••/ 


DSP  code 

MItMAX:  ‘rlA  «  a2  «  a2 

noo 

rl4  •  rU  -  24 
r1  .  •rl4«.r19  /•  RANGE  •/ 
r2  .  ‘rlAiH-rlD  /•  MAX  •/ 
r3  -  ‘rlA^+rlS  /•  MIN  •/ 
rl6  -  ‘rlA+^rlS  /•  INCX  •/ 
r4  •  ‘rlA^+rig  /•  SX  •/ 
rl7  .  ‘rlA^+rig  /•  N  •/ 
rl6  .  rl6  •  2 
rl6  •  rl6  •  2 

al  ■  ‘rA  /•  Starting  min  •/ 

a2  «  ‘rA^-frlG  /•  Starting  max  •/ 

rl7  .  rl7  -  2 
if  (mi)  goto  minmaxe 
rl7  .  rl7  -  1 
minmaxl:  aO  «  al  -  ‘rA 

al  »  ifalt(‘r4)  /•  new  max  •/ 
aO  ■  -a2  +  ‘rA 
if  (r17 —  >»0)  goto  minmaxl 
a2  »  ifalt(‘rA++rl6) 

•r3  >  a2  >  a2  /•  store  min  •/ 

•r2  •  al  »  al  /•  store  max  •/ 

•rl  •  aO  «  al  ~  a2  /•  store  range  •/ 
minmaxe:  a2  »  ’rlA 

return  (rl8) 

nop  /•••  End  MINMAX  •••/ 


MOMENT 


Prototype  :  void  MOMENT  (  int  N,  float  SXI  ].  int  INCX.  float  ■THIRD,  float  ‘FOURTH  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  input  floating  point  (centered)  x  array 

INCX  integer  increment  for  x  array 

"THIRD  pointer  to  third  moment  (Skewness) 

•FOURTH  pointer  to  fourth  moment  (Kurtosis) 

Description  :  MOMENT  calculates  the  third  and  fourth  moments  of  the  centered  x 
array.  Skewness  and  kurtosis  can  be  calculated  from  these  moments 
through  scaling  and  of^t  constants  (see  Press  [1988]).  Because  of 
the  nops  introduced  through  pipelining  effects,  this  function  is  not  as  efficient  for  OSP 
operation  as  the  related  SSQR  function. 
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C  cod» 
void 


DSP  oodo 


HGMENT  (  int  N,  float  SX[],  int  INCX, 
float  nHIRO,  float  •FOURTH  ) 

{  ragistar  int  1,  n; 
static  float  taap; 

•THIRD  -  •FOURTH  -  0.0; 
for  (lansO;  n<N;  1'f«INCX,  ivi-t-) 

{  taap  -  SX[1]  •  SX[i]  •  SXti]; 

•THIRD  taap; 

/•  Caiculata  3rd  aoaant  •/ 
•FOURTH  +-  SX[1]  •  twip; 

/•  Caiculata  4th  aoaant  •/ 

} 

}  /•««■  End  MOMENT  •«•/ 


MOMENT:  •rU  •  a2  >  a2 

.  rU  -  20 

r1  »  •rli^+rig  /•  FOURTH  •/ 
r2  •  •r14++r19  /•  THIRD  •/ 
r16  -  •rld^-frlD  /•  INCX  •/ 
r3  »  •r14^r19  /•  SX  •/ 
r15  -  •r14<-^r19  /•  N  •/ 
r4  ■  A  ZERO 
r16  -  r16  •  2 
r16  -  rl6  •  2 
aO  ■  •rd 
a1  ■  •rd 
r15  -  r15  -  2 
aoaantl:  a2  -  •r3  •  •r3 

nop 
nop 

a1  »  al  ♦  a2  •  •r3<-i-r16 
if  (r15 —  >«0)  goto  aoaantl 
aO  >  aO  -t'  a2  •  a2 

•r2  «  a1  •  round(al)  /•  3rd  aoaant  •/ 
•r1  •  aO  ■  round(aO)  /•  dth  aoaant  •/ 
a2  •  •rid 
ratum  (rifl) 

nop  /•»  End  MOMENT  •«/ 


PROD 


Prototype : 
Arguments  : 


Description  : 


float  PROD  ( int  N.  float  SX{  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  array 

INCX  array  integer  increment  or  step 

PROD  returns  the  cumulative  product  of  an  array  SX.  A  single  nop 

introduced  by  pipelining  effects  reduces  the  efficiency  for  DSP  operation  by 


50%. 


C  coda 


DSP  coda 


float  PROD  (  int  N,  float  SX[].  int  INCX  )  PROD: 

{  register  int  1,  n; 
static  float  prod; 

prod  «  1.0; 

for  (1«n»0;  n<N;  1+-INCX,  n++)  /•  array  •/ 
prod  •»  SX[li; 

/•  calculating  cuaulative  product  •/ 
return(  prod  ); 

)  /•••«■  End  PROD  ••••/  prodi; 


rid  .  rid  -  12 

r17  =  •r1d++r19  /•  INCX  •/ 

r3  .  •r14++r19  /•  SX[]  •/ 

r2  •  •r1d++r19  /•  N  •/ 

r1  •  A  ONE 

aO  .  ^rl 

r17  -  r17  •  2 

r17  -  r17  •  2 

r2  -  r2  -  2 

aO  •  aO  •  •r3++r17 

if  (r2 — >«0)  goto  prodi 

nop 

return  (r18) 

nop  /•«  End  PROD  •«/ 
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QABS 


Prototype : 
Arguments  : 
Description  : 


float  QABS  (  float  VALUE  ) 

VALUE  floating  point  argument 

QABS  returns  the  abaolute  value  of  a  single  argument.  This  function  was 
introduced  because  of  the  poor  efficiency  provided  by  the  AT&T  C  library  bbsQ 
function. 


w  -  |x| 


C  coda 


DSP  coda 


float  QABS  (  float  VALUE  )  QABS:  r14  -  r14  -  4 

{  ratum(  fabs(  VALUE  )  );  aO  >  -‘rlA 

/•  ratum  fabs  of  value  •/  aO  «  1falt(*r14++r19) 

)  /*••••  End  QABS  •«•/  return  (rIB) 

nop  /•«  End  QABS  •»/ 


QABSA 


Prototype : 
Arguments  : 


Description  : 


void  QABSA  (  int  N.  float  SX[  ].  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

QABSA  converts  all  elements  in  array  x  to  their  absolute  values.  This  operation  is  well  suited 
for  DSP  operation. 


x^  -  |x,|  :  i  m 


C  code 


PSP 


void  QABSA  (  int  N.  float  SX[],  1nt  INCX  )  QABSA; 

{  register  int  1,  nj 

for  (1«n»0j  n<N5  1+»INCX,  i>++)  /•  array  */ 

SX[1]  =  QABS(  SX[1]  ); 

/*  replace  elements  with  their  abs  value  */ 

)  /•••••  End  QABSA  *•••/ 

qabsa! : 


rl4  «  r14  -  12 

rl6  «  "rlA+^rlB  /•  INCX  •/ 

r1  .  •rlA+trig  /•  SX  •/ 

r2  «  ‘rlA+^rig  /•  NUMBER  •/ 

r3  .  r1 

rl6  .  r16  •  2 

r16  «  r16  •  2 

r2  .  r2  -  2 

aO  »  -"rS+^rlB 

If  (r2--  >«0)  goto  qabsal 

•rl++rl6  •  aO  •  1falt(*r1) 

return  (rlB) 

nop  /•»  End  QABSA  •*•/ 


QMAX 


Pirototype  : 
Arguments  : 

Description  : 


float  QMAX  (  float  VALUEl,  float  VALUE2  ) 

VALUE  1  first  floating  point  number 
VALUE2  second  floating  point  number 
QMAX  returns  the  maximum  of  two  floating  point  numbers. 
This  fimction  was  introduced  because  of  the  poor  efficiency  of 
the  DSP  C-compiier  version. 


xi  .  x,.l 
Xj  >  X,.J 
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OSP  coda 


rioat  QMAX  (  float  VALUE!,  float  VALUE2  )  QMAX:  rl4  -  rl4  -  8 

{  r«turn(  (VALUE!  >  VALUE2)  ?  VALUE!  :  VALUE2  );  aO  •  ‘rlA^+rlS  /•  start  with  VALUE2  •/ 

)  /•***•  End  QMAX  ••••/  a!  »  aO  -  "rlA  /•  co^iara  VALUE!  •/ 

aO  »  Ifalt(*r14++r19) 

ratum  (r18) 

nop  /•«  End  QMAX  •»/ 


Prototype  :  float  QMIN  (  float  VALUEl.  float  VALUE2  ) 

Arguments  :  VALUEl  pointer  to  first  floating  point  number 

VALUE2  pointer  to  second  floating  point  number 
Description  :  QMIN  returns  the  minimum  of  two  floating  point  numbers. 

This  fimction  was  introduced  because  of  the  poor  efficiency  of 
the  DSP  C-compiler  version. 

C  code  DSP  code 

float  QMIN  (  float  VALUE!,  float  VALUE2  )  QMIN:  rl4  -  rl4  -  8 

{  return(  (VALUEl  <  VALUE2)  ?  VALUEl  :  VALUE2  );  aO  •  ‘rlA^rlS  /•  start  with  VALUE2  •/ 

}  /•«••  End  QMIN  ••••/  al  •  -aO  +  ‘rlA  /•  compare  VALUEl  •/ 

aO  •  1falt(*r14+4-rl9) 

return  (r18) 

nop  /•••  End  QMIN  •»/ 


float  SASUM  ( int  N.  float  SX[  ].  int  INCX  ) 

N  number  of  elements  in  amy 

SX[  ]  floating  point  array 

INCX  array  integer  increment  or  step 

SASUM  takes  the  sum  of  vector  component  magnitudes  from  the  array  SX. 
Adapted  from  BLAS  library.  This  operation  is  well  suited  for  DSP 
operation. 

C  code  DSP  code 

float  SASUM  (  int  N.  float  SX[],  int  INCX  )  SASUM: 

{  register  int  n,  i; 
float  out; 

out  =  0.0; 
if  (N  <  0) 
return(  0.0  ); 

for  (i»n=0;  n<N;  i+»INCX,  n++) 
out  +«  fabs(  SX[i]  ); 

return(  out  );  sasuml: 

)  /*••**  End  SASUM  •*••/ 


rl4  =  r14  -  12 

rl7  •  •rlAi-frlO  /•  INCX  •/ 

r2  »  ‘rlA^+rlO  /•  SX  •/ 

r3  «  •rlA^+rlO  /•  N  •/ 

r4  =  A  ZERO 

rl7  «  rl7  •  2 

rl7  «  rl7  •  2  /*  inc  x  •/ 
aO  «  ‘rA  /•  load  zero  •/ 
r3  =  r3  -  2  /•  N  -  2  counter  •/ 

al  =  -  •r2 

al  -  ifalt(*r2-M-rl7) 
if  (r3 —  >=0)  goto  sasuml 
aO  «  al  -«■  aO 
return  (rl8) 

nop  /•••  End  SASUM  *••/ 


Prototype : 
Arguments  : 


Description  : 


SASUM 


QMIN 
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SAXPY 


Prototype : 
Arguments  : 


Description  : 


void  SAXPY  ( int  N,  Hoat  SX[  ],  int  INCX,  Hoat  SY[  ],  int  INCY,  float  SA  ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  array  increment  for  x  array 

SY[  ]  floating  point  y  array  (output) 

INCY  integer  array  increment  for  y  array 

SA  floating  point  scale  fiKtor,  a 

SAXPY  is  the  elementary  vector  opoation  y  -  y  ax.  Adi^>ted  from  BIAS  library.  Array 
multiplication  and  accumulate  make  this  well  suited  for  DSP  operation. 


f  -  S  *  ai 


C  coda 

void  SAXPY  (int  N.  float  SX(],  int  INCX, 
float  SY[].  int  INCY.  float  SA  ) 

{  register  int  n,  i,  j; 

if  (N  <  0) 
return; 

for  (i-j-n«Q;  n  <  N;  i  +-  INCX,  J  +-  INCY.  n^) 
SY[J]  +«  SA  •  SX[1]i 
1  /*«»•  End  SAXPY  •»*/ 


DSP  code 

SAXPY:  r14  .  r14  -  24 

a1  •  ‘rl^w-rig  /•  SA  •/ 
rl7  .  VIA+^-rig  /•  INCY  •/ 
r4  .  •r14++rl9  /•  SY  •/ 
rl6  -  •rl4-w-rl9  /•  INCX  •/ 
r3  •  •r14++rl9  /•  SX  •/ 
rl  »  •r14^rl9  /•  N  •/ 
r16  -  r16  •  2 

r16  •  r16  •  2  /•  inc  x  •/ 
r17  .  r17  •  2 

r17  .  r17  •  2  /•  inc  y  •/ 
rl  •  rl  -  2  /•  N  -  2  counter  •/ 

saxpyl;  if  (rl—  »0)  goto  saxpy! 

•r4++rl7  *  aO  «  “rA  ♦  al  *  ViM-rlS 
return  (r18) 

nop  /•»•  End  SAXPY  •"/ 


SCALCPY 


Prototype : 
Arguments  : 


Description  : 


void  SCALCPY  (  int  N,  float  SX(  ],  int  INCX.  float  SY[  ],  int  INCY.  float  SA  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  output  floating  point  y  array 

INCY  integer  increment  for  y  array 

SA  floating  point  scaling  factor 

SCALCPY  multiplies  the  input  array,  x,  by  a  floating  point  scalar  and  then  copies  x  to  y. 
Array  multiplication  makes  this  well  suited  for  DSP  operation. 
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DSP  oodm 


void  SCALCPY  (  int  N.  float  SX[].  1nt  INCX, 
float  SY[],  1nt  INCY,  float  SCALE  ) 


{  ragistar  int  1,  j,  n; 


for  (l«j«n«0;  n<N;  l-t-alNCX,  J-*-«INCV,  n+-t') 

SYtJ]  •  SX[1]  •  SCALE: 

/*  sca1a  all  alanonta  In  X  */ 
)  /••»•  End  SCALCPY  »»/ 


SCALCPY:  rU  .  r14  -  24 

al  -  •r144-*-rl9  /•  SA  •/ 
rl6  -  •rl4++rl9  /•  INCY  •/ 
r4  >  •r14++r19  /•  SY  •/ 
r17  «  •r14^rl9  /•  INCX  •/ 
r1  .  •r14++rl9  /•  SX  •/ 
r3  -  VIA^-^rig  /•  N  •/ 
rl6  .  r16  •  2 

rl6  «  r16  *2  /•  Inc  y  •/ 
rl7  .  rl7  •  2 

rl7  »  rl7  •  2  /•  Inc  x  •/ 
r3  »  r3  -  2  /•  N  -  2  counter  •/ 

scalcpy!:  If  (r3 —  >«0)  goto  scalcpyl 

•r4++r16  ■  aO  »  al  •  "rl-M-rW 
return  (rl8) 

nop  /«•  End  SCALCPY  **•/ 


SCOPY 


Prototype : 
Arguments  : 


Description  : 


void  SCOPY  (int  N.  Hoat  SX[  ],  int  INCX.  Hoat  SY[  ].  int  INCY) 

N  number  of  elements  in  array 

SX(  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY(  ]  floating  point  y  array  (output) 

INCY  integer  increment  for  y  array 

SCOPY  copies  one  vector  onto  another.  Adapted  from  BLAS  library.  Automatic  incrementing 
of  arrays  makes  this  well  suited  for  DSP  operation. 


C  ooda 

void  SCOPY  (Int  N.  float  SX[],  int  INa, 
float  SY(],  Int  INCY) 

{  register  Int  n,  1,  j; 

If  (N  <  0) 
return; 

for  (l»j«n=0;  n<N;  1+»INCX,  J+»INCY,  n++) 
SY(J]  .  SX[1]; 

)  /•••••  End  SCOPY  ••••/ 


PSP  code 

SCOPY:  rl4  .  rl4  -  20 

rl7  .  •rl4++rl9  /•  INCY  •/ 
rl  .  •rl4++rl9  /•  SY  •/ 

rl6  -  •rl4-M-rl9  /•  INCX  •/ 
r2  »  Vl^M-rig  /•  SX  •/ 
r3  «  •rl4++rl9  /•  N  */ 
rl6  .  rl6  •  2 

rl6  »  rl6  •  2  /•  Inc  x  •/ 

rl7  =  rl7  •  2 

rl7  =  rl7  •  2  /•  Inc  y  •/ 
r3  »  r3  -  2  /*  N  -  2  counter  •/ 

scopyl:  If  (r3 —  >«0)  goto  scopyl 

•rl++rl7  »  aO  «  •r2++rl6 
return  (rl8) 

nop  /•*•  End  SCOPY  •••/ 


SDOT 


Prototype  :  float  SDOT  (int  N.  float  SX[  J,  int  INCX.  float  SY[  ].  int  INCY  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array 

INCY  integer  increment  for  y  array 

Description  :  SDOT  takes  the  dot  (iimer)  product  between  two  vectors.  SDOT  is  adapted  from  BLAS 
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C  code 


library.  Array  multiplication  and  accumulation  makes  this  well  suited  for  DSP  operation. 

DSP  code 


float  SOOT  (int  N,  float  SX[],  1nt  INCX, 
float  Svi],  int  INCY  ) 

{  register  int  n,  i,  j; 
float  out; 

out  •  0.0; 
if  (N  <  0) 
return(  0.0  ); 

for  (i»j»n«0;  n<N;  i+alNCX,  j+«INCY,  n4-f) 
out  +-  SY[j]  •  SX[i]; 
return(  out  ); 

)  /«■•••  End  SOOT  **•*/ 


SOOT:  rU  •  rU  -  20 

r3  •  A  ZERO 

r17  .  *r14++rl9  /•  INCY  •/ 
rA  .  ‘rlA^+rig  /•  SY  •/ 
aO  •  ‘rS  /•  load  zero  •/ 

rl6  •  •r14++r19  /•  INCX  •/ 
r2  «  •rl4++rl9  /•  SX  •/ 
r3  »  •rl4++rl9  /•  N  •/ 
rl6  *  rl6  •  2 

rl6  •  rl6  •  2  /•  inc  x  •/ 
rl7  .  r17  •  2 

r17  •  r17  •  2  /•  inc  y  •/ 
r3  »  r3  -  2  /•  N  -  2  counter  •/ 

sdotl:  if  (r3 —  >»0)  goto  sdotl 

aO  a  aO  -t-  ’r^-M-rlS  *  •r4++r17 

return  (rl8) 

nop  /•«  End  SOOT  •••/ 


SIGN 


Prototype : 
Arguments  : 

Description  : 


float  SIGN  (  float  VALOUT,  float  VALUE  ) 

VALOUT  value  to  be  returned  with  sign  (y)  ^ 

VALUE  value  whose  sign  is  relumed  (x)  |  y>  x  i  0, 

SIGN  transfers  the  sign  of  x  to  y  returns  it.  Adapted  from  ^  ~  1  x  <  0. 

Fortran  library.  This  was  introduced  to  maintain  compatibility  _ ! _ ! 

with  FORTRAN  routines. 


C  code 

float  SIGN  (  float  VALOUT,  float  VALUE  ) 

{  returnC  (VALUE  «  0.0)  ?  -QABS(VALOUT)  : 

QABS(VALOUT)  ); 

)  /•••••  End  SIGN  «••/ 


OSP  code 

SIGN:  rl4  .  rl4  -  4 

aO  a  -•rl4  /•  abs  VALOUT  •/ 

•rl4  a  aO  a  ifalt(*rl4) 

aO  a  -aO 

r14  a  rl4  -  4 

a1  a  -•rl4++rl9  /•  check  VALUE  sign  •/ 

aO  a  1fa1t(*r14++rl9) 

return  (rl8) 

nop  /•«  End  SIGN  ***f 


SIGNA 


Prototype : 
Arguments  : 


Description  : 


float  SIGNA  (  int  N,  float  SX[  J.  int  INCX.  float  SY[  ].  int  INCY,  float  OUT[  ]  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  sign  transfer  y  array 

INCY  integer  increment  for  y  array 

OUT[  ]  output  array 

SIGNA  transfers  the  sign  of  values  in  the  x  array  to  y  array 
and  then  copies  to  the  output  array.  This  is  useful  for  creating  truncated  waveforms  or 
assigning  (  +  ,-)  values  to  an  array. 
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C  code 


OSP  code 


void  SIGNA  (  int  N,  float  SX[].  1nt  INCX,  SIGNA: 

float  SY[],  mt  INCY,  float  0UT[]  ) 

{  register  int  1,  j,  k,  >  : 

for  (1»j»k»n»0j  n<N;  1+«INCX,  j+«INCY,  k++,  n++) 

/•  go  through  arrays  •/ 

0UT[k]  .  SIGN(  SX[1],  SY[J]  ); 

/•  transferring  signs  •/ 

)  /••»»•  End  SIGNA  •*«■/ 


signal : 


rl4  .  rl4  -  24 

rl  .  •rl4++rl9  /•  OUT  •/ 

rl7  »  •rl4++rl9  /•  INCY  •/ 

r3  ■  •r14++rl9  /•  SY  •/ 

rl6  »  •r14++rl9  /•  INCX  •/ 

r2  •  •r14++rl9  /•  SX  •/ 

rl5  «  •rl4++rl9  /•  N  •/ 

rl7  -  rl7  •  2 

rl7  »  r17  •  2 

rl6  «  rl6  •  2 

rl6  -  r16  •  2 

rl5  •  r15  -  2 

aO  »  -"rS 

•rl4  »  aO  »  1falt(*r3++rl7) 

aO  >  -aO 

a1  ■  -•r2++rl6 

If  (rlS —  >«0)  goto  signal 

•rl++  .  aO  =  1falt(*rl4) 

return  (rlB) 

nop  /•••  End  SIGNA  •••/ 


SNRM2 


Prototype  :  float  SNRM2  (int  N.  float  SX[  ].  int  INCX  ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  array 

INCX  integer  increment  for  array 

Description  :  SNRM2  finds  the  Euclidean  length  of  a  vector.  Adapted  from 

BIAS  library.  For  short  arrays,  the  overhead  associated  with  the 
square  root  dominates.  Otherwise  efficiency  is  comparable  to  SSQR. 


C  code 

float  SNRM2  (Int  N,  float  SX[],  Int  INCX  ) 
{  register  Int  n,  1; 
float  out; 

out  s  0.0; 

If  (N  <  0) 
return(  0.0  ); 

for  (1=n=0;  n<N;  1+=INCX,  n++) 
out  +=  SX[1]  •  SX[1]; 
return(  sqrt(out)  ); 

}  /**•••  End  SNRM2  ••••/ 


DSP  code 

SNRM2:  rl4  =  rl4  -  12 

rl7  =  *rl4++rl9  /•  INCX  •/ 
r2  =  •rl4++rl9  /•  SX  */ 
r3  =  •rl4++rl9  /•  N  */ 
r4  =  A  ZERO 
rl7  .  rl7  •  2 

rl7  =  rl7  •  2  /•  Inc  x  •/ 

aO  =  ‘rA  /•  load  zero  •/ 

r3  =  r3  -  2  /•  N  -  2  counter  */ 

snrmZl:  If  (r3 —  >=0)  goto  snrmZl 

aO  =  aO  •r2++rl7  •  •r2 
nop 

•rl4++rl9  =  rl8 
•rl4++rl9  =  aO  =  aO 
nop 

call  sqrt  (rl8) 
sqrtl:  rl8  =  sqrtl+4 

rl4  =  rl4  -  8 
rl8  =  VIA 
nop 

return  (rl8) 

nop  /•••  End  SNRM2  •••/ 
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SROT 


Prototype : 
Arguments  : 


Description  : 


void  SROT  (int  N.  float  SX[  ],  int  INCX,  float  SY[  ].  int  INCY,  float  COS.  float  SIN) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array 

INCY  integer  increment  for  y  array 

COS  cosine  projection 

SIN  sine  projection 

SROT  applies  a  Givens  plane  rotation  to  the  x  and  y  arrays.  Values  for  COS  and  SIN  can  be 
obtained  from  SROTG.  Adapted  from  BLAS  library.  Floating  point  multiplies  and 
accumulates  on  two  separate  arrays  makes  this  well  suited  for  DSP  operation. 


c  s 

-s  e 

yi 

C  code 


DSP  code 


void  SROT  (int  N.  float  SX[],  int  INCX, 

float  SY[],  int  INCY,  float  C,  float  S) 
{  register  int  n,  i,  j; 
static  float  stemp; 

if  (N  <  0) 
return; 

for  (1«J=n«0;  n<N;  i+«INCX,  j+=INCY,  n++) 

{ 

stemp  .  C  •  SX[i]  +  S  •  SY[J]; 

SY[j]  .  C  •  SY[J]  -  S  •  SX[i]; 

SX[i]  *  stemp; 

1 

)  /«•••  End  SROT  •*••/ 


SROT:  •rl4-M-r19  -  a2  -  a2 

•rl4-M-rl9  °  a3  ^  a3 
nop 

rl4  .  r14  -  36  /•  (  2  +  7  )  •  4  •/ 

al  >  •rl4++rl9  /•  S  (sin)  •/ 

aO  »  •rl4^rl9  /•  C  (cos)  */ 

rl7  «  •rl4++rl9  /•  INCY  •/ 

r4  »  •rl4++rl9  /•  SY  •/ 

rl6  •  •rl4++rl9  /•  INCX  •/ 

r2  =  •rl4++rl9  /•  SX  •/ 

rl5  .  •rl4++r19  /•  N  •/ 

rl6  »  rl6  •  2 

rl6  »  rl6  •  2  /•  inc  x  •/ 
rl5  •  rl5  -  2  /•  N  -  2  counter  •/ 
rl7  »  rl7  •  2 
rl7  .  rl7  •  2  /•  inc  y  •/ 
srotl:  a2  »  al  •  •r4  /•  S  •  SY  •/ 

a3  =  -al  •  •r2  /•  -S  •  SX  •/ 
•r4-M-rl7  =  a3  =  a3  ♦  aO  »  •r4 
/•  -S  SX  +  C  SY  */ 
if  (rl5 —  >=0)  goto  srotl 
•r2++rl6  =  a2  =  a2  +  aO  *  •r2 
/•  S  SY  +  C  SX  •/ 
a2  =  •rl4-f+rl9 
a3  =  •rl4++rl9 
return  (rlB) 

rl4  »  rl4  -  8  /*••  End  SROT  •••/ 


SSCAL 


Prototype  : 
Arguments  : 


Description  : 


void  SSCAL  (int  N,  float  ‘"SA,  float  SY[  ],  int  INCY) 

N  number  of  elements  in  array 

’"SA  pointer  to  floating  point  scale  factor 

SY[  ]  floating  point  array  (output) 

INCY  integer  increment  for  array 

SSCAL  multiplies  a  vector  by  a  scalar.  Adapted  from  BLAS.  Array  multiplication  makes  this 
well  suited  for  DSP  operation. 
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C  code 


DSP  code 


void  SSCAL  (  int  N.  float  SX[].  Int  INCX, 
float  SA  ) 

{  Int  n,  1; 

1f  (N  <  0) 
return; 

for  (l»n»0;  n<N;  ImINCX,  n+-*-) 

SX[1]  •-  SA; 

)  End  SSCAL  ••••/ 


SSCAL:  rl4  -  rl4  -  16 

a1  -  •rl4++r19  /•  SA  •/ 
rl7  .  "rlA+^rlD  /•  INCX  •/ 
r1  .  •r14+4-r19  /•  SX  •/ 
r3  -  •r14++rl9  /•  N  •/ 
rl7  -  r17  •  2 
rl7  -  rl7  •  2  /•  Inc  x  •/ 
r3  «  r3  -  2  /•  N  -  2  countar  •/ 

sscall:  If  (r3 —  >«0)  goto  sscall 

•rl4-fr17  •  aO  •  al  •  "rl 
return  (r18) 

nop  /•*•  End  SSCAL  •»/ 


SSQR 


Prototype  : 
Arguments  : 


Description  : 


float  SSQR  (  int  N.  float  SX[  ].  int  INCX  ) 

N  number  of  elements  in  array 

SX[  ]  input  floating  point  x  array 

INCX  integer  increment  for  x  array 

SSQR  calculates  the  sum  of  squares  of  a  vector’s  components.  If  the  vector  is 
centered,  this  is  equivalent  to  the  second  moment.  Array  multiplication  makes 
this  well  suited  for  DSP  operation. 


C  coda 

float  SSQR  (  int  N.  float  SX[],  1nt  INCX  ) 

{  register  Int  1,  n; 
static  float  out; 

out  >  0.0; 

If  (N<0) 
return(O.O); 

for  (1an=0;  n<N;  1+*INCX,  n++) 

/•  calculate  sum  of  squares  •/ 
out  SX[  1  ]  •  SX[  1  ] ; 

/•  for  all  values  In  SX  •/ 
return!  out  ); 

)  /•••••  End  SSQR  *•••/ 


DSP  code 

SSQR:  r14  .  r14  -  12 

r17  -  •r14++r19  /•  INCX  •/ 
r2  •  •r14-M-r19  /•  SX  •/ 
r3  »  ‘rlA-f+rlO  /•  N  •/ 
r4  .  A  ZERO 
r17  •  r17  •  2 

rl7  «  r17  •  2  /•  1nc  x  •/ 

aO  »  ‘rA  /•  load  zero  •/ 

r3  «  r3  -  2  /•  N  -  2  counter  •/ 

ssqrl:  If  (r3 —  >«0)  goto  ssqrl 

aO  >  aO  4-  ‘r^+^-rlZ  •  •r2 
return  (r18) 
nop  /•••  End  SSQR  •••/ 


SSWAP 


Prototype : 
Arguments  : 


Description  : 


void  SSWAP  (int  N.  float  SX[  J.  int  INCX.  float  SY[  ),  int  INCY) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array  (output) 

INCY  integer  increment  for  y  array 

SSWAP  interchanges  two  vectors.  Adapted  from  BLAS  library.  Automatic  incrementing  of 
arrays  makes  this  well  suited  for  DSP  operation. 


y  •>  £ 
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C  coda 


DSP  code 


void  SSWAP  (Int  N.  float  SX[].  Int  INCX, 
float  SYt],  1nt  INCY) 

{  register  int  n,  1,  J; 
static  float  stemp; 

If  (N  <  0) 
return; 

for  (1»j»n»0i  n<N;  l+xINCX,  j+=INCY,  n++) 

{ 

stemp  «  SX[1]; 

SX[1]  -  SY[j]: 

SY[j]  •  stamp; 

} 

)  /*•**•  End  SSWAP  ••**/ 


SSWAP:  rl4  -  rl4  -  20 

rl7  -  •r14++rl9  /•  INCY  •/ 
pi  .  V14++r19  /•  SY  •/ 
rl6  ■  •r14++r19  /•  INCX  •/ 
r2  -  •rl4++rl9  /•  SX  •/ 
r3  ■  •rl4++rl9  /*  N  •/ 
rl6  «  r16  •  2 
r16  ■  r16  •  2  /•  Inc  x  •/ 
rl7  -  rl7  •  2 

r17  •  rl7  •  2  /•  Inc  y  •/ 
r3  «  r3  -  2  /•  N  -  2  counter  •/ 

sswapi:  aO  •  *r2 

•r2++rl6  ■  a1  »  •rl 
If  (r3 —  >«0)  goto  sswapi 
•rl++r17  «  aO  a  aO 
return  (r18) 

nop  /•»  End  SSWAP  *«/ 


SUBVEC 


Prototype : 
Arguments  : 


Description  : 


void  SUBVEC  (  int  N,  float  SX[  ].  float  SY[  1,  float  SZ[  ] ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array  input 

SY[  ]  floating  point  y  array  input 

SZ[  ]  floating  point  output  array 

SUBVEC  subtracts  two  floating  point  vectors.  The  modifled  values  are  returned  in  a  separate 
array.  Array  subtraction  and  automatic  incrementing  makes  this  well  suited  for  DSP  operation. 


£*  i  -  f 


C  code 

DSP  code 

void  SUBVEC  (  lot  N,  float  SXf], 

SUBVEC: 

r14  =  r14  -  16 

float  SY[],  float  SZ[]  ) 

Pi  =  •p14++p19 

/•  output  array  SZ[] 

•/ 

{ 

register  Int  n; 

r2  =  *r14++pl9 

/•  Input  array 

SY[] 

•/ 

r3  »  »rl4++rl9 

/•  Input  array 

SX[] 

•/ 

for  (n=0;  n<N;  n++) 

r4  «  •rl4++rl9 

/•  #  elements  N 

•/ 

SZ[n]  .  SX[n]  -  SY(n]; 

nop 

} 

/«««««  End  SUBVEC  *»••/ 

p4  =  r4  -  2 

subvecl: 

If  (r4 — >»0)  goto  subvecl 
•r1++  =  aO  =  •r3++  -  •r2++ 

return  (rlB) 

nop  /•••  End  SUBVEC  •**/ 

SUMUNTIL 


Prototype : 
Arguments  : 


Description  : 


int  SUMUNTIL  (int  N,  float  SX[  ],  int  INCX,  float  SA  ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SA  floating  point  ending  value 

SUMUNTIL  performs  a  cumulative  sum  on  an  array  of  numbers,  returning 
the  index  where  the  sum  exceeds  the  value  set  by  SA.  This  is  useful  for  calculating  quartiles. 
The  conditional  statement  decreases  the  performance  of  the  DSP  version. 
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C  code 


DSP  code 


int  SUMUNTIL  (  int  N.  float  SX[].  int  INCX, 
float  SA  ) 

{  ragistar  Int  J,  n; 
static  float  sum; 

sum  •  0.0; 
n  »  N  •  INCX; 
j  »  0; 
do 

{  sum  SX[j]; 

J  »  INCX; 

) 

while  ((J<n)  U  (sum<»SA)); 
raturn(  j-INO<); 

)  /•••••  End  SOMUNTIL  •»•/ 


SUHUNTIL: 


sumtol: 


sumtoand: 


•rl4++rl9  «  a2 
nop 

r14  -  rl4  -  20 
a2  -  •rl4++rl9 
rl7  »  •rl4++rl9 
r3  ■  •rl4++rl9 
r2  ■  •rl4+4'rl9 
rl7  «  rl7  •  2 
rl7  -  r17  •  2 
aO  « 

r2  -  r2  -  2 
rl  -  rl  -  rl 
r1  .  rl  -  1 
al  •  aO  -  a2 


•  a2 


/•  SA  boundary  •/ 

/•  INCX  •/ 

/•  SX[]  •/ 

/•  N  •/ 

/*  adjust  for  float  •/ 

/•  loop  counter  •/ 

/•  sat  index  counter  •/ 
/*  exceed  SA  check  */ 


nop 

nop 

nop 

if  (age)  goto  sumtoend 

/*  check  for  end  of  summation  */ 

aO  >  aO  -«■  •r3++rl7 

if  (r2 — >»0)  goto  sumtol 

rl  -  rl  ♦  1 

/•  increment  index  counter  •/ 
a2  -  ‘rU 
return  (rl8) 

nop  /•«  End  SUHUNTIL  •"/ 


TRANSPOSE 


Prototype  : 
Arguments  : 


Description  : 


void  TRANSPOSE  (int  M.  int  N.  float  MATA[  ],  float  TRAN[  ] ) 
M  number  of  rows  in  matrix 

N  number  of  columns  in  matrix 

MATA[  ]  input  floating  point  matrix 
TRAN[  ]  output  floating  point  matrix 

TRANSPOSE  returns  the  transpose  of  an  MX N  matrix.  The  matrix 
is  stored  as  a  one  dimensional  array  in  row-major  order.  Because  of 
arrays,  this  operation  is  well  suited  for  DSP  operation. 


B  •  A^, 

and  BeR"*" 


automatic  incrementing  of 


C  code 


DSP  code 


void  TRANSP  (  int  N,  r  ■.  p,  float  MATA[], 

float  TRAN(]  ) 

{  register  int  n,  p; 

for  (p»0;  p<P;  pn-) 
for  (n=0;  n<N;  n++) 

TRAN[p*N  +  n]  =  MATA[n»P  +  p]; 

)  /••*••  End  TRANSP  ••••/ 


TRANSP:  rU  .  rl4  -  16 

r3  =  •rl4++rl9  /•  TRAN  matrix  t  data  •/ 
r4  =  •rl4++rl9  /•  MATRIX  matrix'data  •/ 
r2  «  •rl4++rl9  /•  P  n_p  size  •/ 
rl7  »  •rl4++r19  /•  N  n~dat  size  •/ 
rl6  =  r2  -  2  /*  copy  of  P  count  •/ 

rl5  •  rl7  •  2 
rl5  -  rl5  •  2 

r2  »  rl6  /•  n_p  size  for  loop  •/ 
rl7  «  rl7  -  2  /*  n  dat  size  for  loop  •/ 
rl  =  r3  /•  copy  of  TRANS  •/ 

transpl;  if(r2 —  >*0)  goto  transpl 

•r3i-*-rl5  »  aO  =  •r4+-t- 
rl  *  rl  +  4 
r2  =  rl6 

if(rl7 —  >=0)  goto  transpl 
r3  »  rl 
return  (rl8) 

nop  /•••  End  TRANSP  •••/ 
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_ UPDATPROD 

Prototype  :  void  UPDATPROD  (int  N.  float  SX[  ].  int  INCX,  float  SY[  ].  int  INCY,  float  SZ(  ].  int 

INCZ) 

Arguments  :  N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array  -  output 

INCY  integer  increment  for  y  array 

SZ[  ]  floating  point  z  array 

INCZ  integer  increment  for  z  array 

Description  :  UPDATPROD  accumulates  the  product  of  elements  in  the  x  and  z  arrays  in  the  y  array.  This 
is  useful  for  calculating  covariance.  Array  multiplication  and  accumulation  makes  this  well 
suited  for  DSP  operation.  An  additional  instruction  is  introduced  because  only  3  memory 
references  are  allowed  per  instruction. 

C  code  DSP  code 

UPDATPROD:  •r14++r19  •  rS 
nop 

rU  -  rU  -  32  /•  (U7)*4  •/ 
r15  .  •r14++r19  /•  INCZ  •/ 
r1  .  •r14++r19  /•  SZ  •/ 
r17  .  •r14++rl9  /•  INCY  •/ 
r4  ■  •r14++r19  /•  SY  •/ 
r16  »  •r14++r19  /•  INCX  •/ 
r2  -  *r14++r19  /•  SX  •/ 
r3  »  •p14++rl9  /•  N  •/ 
r5  »  r4 
r16  .  H6  •  2 
r16  .  r16  •  2  /•  Inc  x  •/ 
r17  .  Pl7  •  2 

r17  .  r17  •  2  /*  Inc  y  •/ 
r15  »  ns  •  2 

ns  =  ns  •  2  /•  1nc  z  •/ 

r3  =  r3  -  2  /•  N  -  2  counter  •/ 

updatpl:  a1  =  •pS++r17 

lf  (r3 —  >*0)  goto  updatpl 
•r44-Krl7  »  aO  »  a1  +  •p2++r16  •  •rl++nS 
nop 

rS  =  '•rl4++rl9 
return  (r18) 

rl4  -  rl4-4  /••••*  End  UPDATPROD  *•••/ 


UPDATSQR 

void  UPDATSQR  (int  N,  float  SX[  ].  int  INCX.  float  SY[  ].  int  INCY  ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array  -  output 

INCY  integer  increment  for  y  array 

UPDATSQR  accumulates  the  square  of  the  elements  in  the  x  array  in  the  y  array.  This  is 
useful  for  calculating  variance.  Array  multiplication  and  accumulation  makes  this  well  suited 
for  DSP  operation.  An  additional  instruction  is  introduced  because  only  3  memory  references 
are  allowed  per  instruction. 


Prototype : 
Arguments  : 


Description  : 


void  UPDATPROD  (  1nt  N.  float  SX(],  1nt  INCX, 
float  SY[],  lnt  INCY.  float  SZ[],  lnt  INCZ  ) 

{  register  lnt  n,  1,  j,  k; 

1f  (N  <  0) 
returnj 

for  (1«j=k=n»0;  n<Ns 

1+-INCX.  j+-INCY.  k+-INCZ.  n++) 

SY[j]  +.  SX[1]  •  SZ[k]i 
)  /*•••*  End  UPDATPROD  ••*•/ 
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DSP  ood> 


void  UPOATSQR  (int  N.  float  SX[],  int  INCX, 
float  SY[],  int  INCY) 

{  ragistar  int  n,  i,  J; 

if  (N  <  0) 
ratum; 

for  (iaJanaO;  n<N;  i>aINCX,  J+alNCV,  rv*+) 
SY[j]  ♦a  SX[1]  •  SX[i]; 

)  /«»•  End  UPOATSQR  •«•/ 


UPOATSQR:  rU  a  rl4  -  20 

rl7  a  /•  INCY  •/ 

r4  a  VIAt-frlO  /•  SV  •/ 
r16  a  arlA^rlO  /•  INCX  •/ 
r2  a  /•  SX  •/ 

r3  a  /•  N  •/ 

rl  a  r4 

ri6  a  ne  •  2 

r16  a  r16  •  2  /•  Inc  x  •/ 
r17  a  r17  •  2 

r17  a  r17  •  2  /•  inc  y  •/ 
r3  a  r3  -  2  /•  N  -  2  oountar  •/ 

updatsql:  a1  a  •r1-M’r17 

if  (r3 —  >a0)  goto  updatsql 
•r4i4-r17  a  aO  a  a1  •r2++rl6  •  •r2 
ratum  (r18) 

nop  /•«»•  End  UPOATSQR  •«•/ 


UPDATSUM 


Prototype : 
Arguments  : 


Description  : 


void  UPDATSUM  (int  N.  float  SX(  ].  int  INCX.  float  SY[  ].  int  INCY  ) 

N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array  -  output 

INCY  integer  increment  for  y  array 

UPDATSUM  accumulates  the  elements  in  the  x  array  in  the  y  array.  This  is  useful  for 
calculating  a  ruiming  mean.  Array  accumulation  makes  this  well  suited  for  DSP  operation. 


y,  -  y,  ♦  X,  :  i  -  l,  ..N 


C  coda 

void  UPOATSUM  (int  N.  float  SX(],  int  INCX, 
float  SY(i,  int  INCY) 

{  register  int  n,  i,  Ji 

if  (N  <  0) 
return; 

for  (iajanaO;  n<N;  i^alNCX,  j+^INO,  n++) 
SY[j]  +.  SX[1]; 

)  /•«••  End  UPOATSUM  ••••/ 


DSP  coda 

UPOATSUM:  r14  .  r14  -  20 

r17  i.  •rl44+rl9  /•  INCY  •/ 
r4  «  •rl4++rl9  /•  SY  •/ 
rl6  «  •rl4++rl9  /•  INCX  •/ 
r2  »  •rl4++rl9  /•  SX  •/ 
r3  •  •rl4++rl9  /•  N  •/ 
rl6  .  rl6  •  2 

rl6  -  rl6  •  2  /•  inc  x  •/ 
rl7  .  rl7  •  2 

rl7  .  rl7  •  2  /•  inc  y  •/ 
r3  «  r3  -  2  /•  N  -  2  counter  •/ 

updatsl:  if  (r3 —  >«0)  goto  updatsl 

•r4++rl7  »  aO  ■  •r4  ♦  •r2++rl6 
return  (rl8) 

nop  /•••••  End  UPOATSUM  ••••/ 


VECMAT 


Prototype : 
Arguments  : 


Description  : 


void  VECMAT  (int  N.  int  P.  float  VECA[  ].  float  MATB(  ].  float  VECCI  ]) 

N  number  of  rows  in  matrix 

P  number  of  columns  in  matrix 

VECA[  ]  input  floating  point  vector 

MATB[  ]  floating  point  matrix 

VECC[  ]  output  floating  point  vector 
VECMAT  multiplies  an  N  x  P  matrix  by  a  vector  of  length  N.  The  matrix  is  stored  as  a  one 


C  =  B  xA, 

AeR"’^,  BeR'*,  and  (feR' 
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dimeasitMial  arrty  ia  row-major  order. 


Ccode 


OSP  cada 


void  VECMAT  (  Int  N.  Int  P,  float  VECA(]. 

float  MATB(].  float  V£OC[]  ) 
{  ragistar  Int  n,  p; 
static  float  sum; 

for  (p^;  p<P;  p*-*-) 

{  sum  m  0.0; 
for  (n>0:  n<N;  n«-»') 

sum  •«-  VECA[n]  «  HATB[n«P  >  p]s 
VECC[p]  >  sum; 

} 

}  /•»»*•  End  VECMAT 


VEOUT; 


vacmatl: 


vacmat2: 


•ria+4-r19  »  r5 
•rl4+4rl9  ■  r6 
•r14w19  -  r7 
nop 

r14  -  rl4  -  32 
r6  -  •r14++r19 
r5  -  •rl4++r19 
r4  -  •rl44-^r^9 
r17  -  •r14++rl9 
r2  -  •r14+^r19 
rl  ■  r5 
r7  -  A  ZERO 
a1  -  •?? 
rl5  -  rl7  •  2 
rl5  ■  rl5  •  2 
rl7  -  rl7  -  2 
aO  >  al 


/•  (3*S)*4  •/ 

/•  addraas  of  C[0]  •/ 
Z*  addraas  of  B[0,0] 
/•  addraas  of  A[0]  •/ 
/•  P  •/ 

/*  H  •/ 

/•  points  to  Bll  •/ 


/•  r15  «  4P  •/ 

/•  loop  countar  for  P 


•/ 


•/ 


r7  »  r4 

/•  points  to  Ak,  Initially  k«l  •/ 
r3  »  rl 

/•  points  to  BkJ,  initially  k«J«l  •/ 
/*  computaa  sum  of  Ak*BkJ  */ 
rl6  -  r2  -  3 

/•  loop  counter  for  k  or  N  •/ 
1f(rl6--  >«0)  goto  vacmat2 
aO  >  aO  -r  •r3++rl5  •  •r7++ 

/•  k  Is  tha  variable  •/ 

•r6++  ■  aO  s  aO  +  •r3++  •  Vf** 

/•  stores  Cj  •/ 

1f(rl7 —  >»0)  pcgoto  vacmatl 
rl  ■  rl  4 

/•  Incramants  J  and  repeat  •/ 
r5  «  •rl44^r19 
r6  «  •rl44-*-rl9 
r7  ■  •rl4++r19 
return  (rlB) 

rl4  .  rl4  -  12  /•»  End  VECMAT  •»/ 


WDOT 


Prototype : 
Aif  uments  : 


Description  : 


float  WDOT  (int  N.  float  SX(  J,  int  INCX.  float  SY[  ].  int  INCY,  float  SWI  ],  int  INCW  ) 
N  number  of  elements  in  array 

SX[  ]  floating  point  x  array 

INCX  integer  increment  for  x  array 

SY[  ]  floating  point  y  array 

INCY  integer  increment  for  y  array 

SW[  ]  floating  point  w  array  (weight) 

INCW  integer  increment  for  w  array 


WDOT  takes  the  weighted  dot  (inner)  product  between  two  vectors.  Adapted  from  BLAS 
library.  Array  multiplication  and  accumulation  makes  this  well  suited  for  DSP  operation. 
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C  coda 


OSP  code 


float  MXJT  (int  N.  float  SX[],  lnt  INCX, 
float  SY[],  int  INCY,  float  SW[],  lnt  INCH  ) 

{  ragistar  lnt  n,  i,  J,  k; 
float  out; 

out  s  0.0; 

1f  (N  <  0) 

ratum(  0.0  ); 

for  (l»J«k»n»0;n<N;  1+'»INCX,J+«INCV,k'*-«INCH,n++) 
out  +-  SW(k]  •  SY[J]  •  SX[1]; 
ratum(  out  ); 

)  /•»••  End  HOOT  ««■»•/ 


HOOT;  rl4  •  rl4  -  28 

r3  -  A  ZERO 

rl5  »  *rl4^r19  /•  INCH  •/ 
rl  .  •rl4++rl9  /•  SH  •/ 
r17  -  •rl4++rl9  /•  INCY  •/ 
r4  >  •rl4++rl9  /•  SY  •/ 
aO  ■  VS  t*  load  raro  •/ 

rl6  -  •rl4++r19  /•  INCX  •/ 
r2  -  •r14++rl9  /•  SX  •/ 
r3  -  •rl4M-rl9  /•  N  •/ 
rl5  -  rl5  •  2 

r15  ■  p15  •  2  /•  Inc  w  •/ 
rl6  -  r16  •  2 

rl6  -  r16  •  2  /•  Inc  x  •/ 
rl7  -  r17  •  2 

rl7  »  rl7  •  2  /•  Inc  y  •/ 
r3  -  r3  -  2  /•  N  -  2  countar  •/ 

udotl;  al  «  •r2-M.rl6  •  •r4++r17 
nop 

if  (r3 —  >«0)  goto  wdotl 
aO  •  aO  ♦  al  •  •rl++rl5 
raturn  (rl8) 
nop  /•••  End  HOOT  •••/ 
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B  Appendix  -  Random  Number  Generation 


B.l  Random  Number  Generators 

Computation-intensive  statistical  methods,  such  as,  bootstrapping,  shuffling,  and  Monte  Carlo 
simulation,  require  that  a  suitable  random  number  generator  be  providol  by  the  computer  system. 
However,  many  of  the  random  number  generators  provided  by  systems  are  often  inadequate  for  siK:h 
implementation.  These  statistical  methods  require  a  highly  random  generator  with  a  long  cycle 
length,  because  of  the  large  data  sets  they  operate  on  and  the  large  number  of  iterations  they 
perform. 

Many  of  the  random  number  generators  used  in  systems  are  prime-modulus  multiplicative 
congruential  generators  of  the  form: 

F(z)  =  a  *  z  mod  m 

where  a  is  an  integer  multiplier  less  than  m,  and  m  is  a  large  prime  integer  (prime  modulus).  When 
selecting  a  good  random  number  generator,  three  important  factors  must  be  considered  [Park  1988]. 

1. )  Function  must  generate  a  full  period  of  length  m-1. 

2. )  The  sequence  of  numbers  should  be  uncorrelated  and  uniformly  random. 

3. )  Can  the  function  be  implemented  within  the  machines  format? 

B.2  The  AT&T  ran  function 

The  random  number  generator  provided  in  AT&Ts  software  library  is  a  slight  variation  of 
a  prime-modulus  multiplicative  congruential  generator  which  generates  uniformly  distributed  real 
values  between  0.0  and  1.0.  The  function  uses  an  initial  seed  which  is  continually  updated  and  used 
to  determine  the  next  random  value.  The  function  has  the  following  form  [AT&T  1988]: 

seed^  =  (25173.0  •  seed^_^  *  13849.0)  mod  65536 

seedf 


This  routine  has  limited  usefulness  in  statistical  application,  however,  because  it  has  a  cycle 
length  of  only  6SS36.  For  example,  when  bootstrapping  is  performed  to  determine  the  variability  of 
regression  coefficients  on  16  separate  data  sets,  only  40%  distinct  bootstraps  are  possible  before 
repetition  occurs. 
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B.3  Improved  Random  Number  Generator 


In  order  for  a  random  number  generator  to  be  useful  in  statistical  computations  it  must  have 
a  very  long  cycle  length,  and  we  must  be  able  to  efficiently  implement  the  function  on  the  DSP.  One 
possible  solution  is  to  choose  a  larger  value  for  the  prime  modulus  m,  and  then  select  a  multiplier 
which  would  create  a  full  cycle.  These  numbers  v/ould  then  be  represented  as  floating-point  integers 
values  since  the  DSP  is  only  limited  to  representing  integers  as  16  bit  values.  However,  this  solution 
is  also  limited  because  the  DSP  uses  only  23  bits  to  represent  the  fractional  part  of  a  floating-point 
value.  Thus  problems  can  occur  when  the  product  of  the  multiplier  and  seed  is  so  large  that  it  must 
be  rounded  to  fit  into  the  fractional  Held  of  the  floating-point  number. 

A  better  solution  is  to  use  a  combination  congruential  generator,  which  has  the  form  [Lewis 
1989,  Wichmann  1985]: 

seed^^  =  al  *  StftfdV.j  mod  ml 
seed^^  =  a2  *  seed^^^^  mod  m2 
seed^ ^  =  a3  *  seed^^_^  mod  m3 


ran 


^seed^^ 
'  ml 


sted^f 

m2 


seed}  ^ 
m3 


)  mod 


1 


Then  by  properly  choosing  numbers  for  al,  a2,  a3  and  ml,  m2,  m3  to  be  179,  183,  182  and  yniX, 
yill9,  32783  respectively,  we  can  create  a  random  number  generator  with  a  cycle  of  8.8  trillion 
[Elkins  1989]. 

The  major  disadvantage  with  such  a  generator,  though,  is  that  it  takes  approximately  three 
times  longer  to  generate  the  random  number.  While  tests  show  that  this  factor  of  three  is  true  for 
the  C  coded  version,  the  DSP  version  only  takes  twice  as  long. 

Performing  the  same  bootstrapping  test  that  was  used  to  evaluate  the  AT&T  ran  function  on 
the  new  random  generator,  we  do  notice  improvement  in  the  cycle  time.  The  obvious  major 
advantage  with  this  new  random  number  generator  is  the  fact  that  repetition  of  identical  random  data 
sets  is  unlikely  to  occur. 
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C  Appendix  -  Floating  Point  Format 


C.l  DSP  Floating-point  Format 

The  data  type  format  for  representing  single  precision  floating-point  numbers  on  the  AT&T 
DSP32  differs  from  the  IEEE  standard  used  in  most  computer  systems.  As  a  result  of  this  difference 
in  number  representation,  all  floating-point  numbers  that  are  downloaded  or  uploaded  between  the 
DSP  and  its  host  processor  must  go  through  a  conversion  process. 

The  number  of  bits  which  hold  the  mantissa  and  exponent  are  the  same  in  each  format, 
however,  their  order  and  representation  differs.  The  format  for  both  DSP  and  IEEE  are  given  below 
for  comparison  [AT&T  19^]: 

OSP:  sfffffff  ffffffff  ffffffff  aooooooo 

IEEE:  SMOoeoa  efffffff  ffffffff  ffffffff 

where:  s  >  sign  bit 

f  >  fractional  part  of  mantissa 
a  w  exponent 

The  actual  floating-point  quantity  which  is  given  in  each  representation  can  be  calculated  in  base  10 
by  the  following  formulas: 

DSP: 

N  =  [(-2)'  ♦  0./n  * 


IEEE: 

N  =  i-iy  *  1.F  * 

From  the  above  formula  we  can  see  that  the  mantissa  for  the  DSP  floating-point  number  is  expressed 
as  a  two’s  complement  quantity  as  compared  to  IEEE’s  sign/magnitude  quantity. 

C.2  Conversion  Process 

The  process  for  converting  DSP  format  to  IEEE  format  can  be  derived  from  the  equations 
given  above,  and  involves  the  following  steps: 

1. )  Save  the  sign  bit,  and  take  the  two’s  complement  of  the  mantissa  if  the  sign  bit  is  set 

indicating  a  negative  quantity. 

2. )  Subtract  one  from  the  exponent. 

3. )  Rearrange  bits  in  the  proper  sequence  according  to  IEEE  format,  placing  the  sign  in 

left  most  bit  position,  exponent  in  next  8  bit  positions,  and  fractional  part  in  last  23 
bits. 

The  process  is  very  similar  for  reversing  the  conversion  process  to  go  from  IEEE  to  DSP.  The  only 
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difference  being  addition  of  one  to  the  exponent  instead  of  subtraction. 

Originally  the  responsibility  for  performing  all  conversions  was  placed  on  the  DSP  because 
it  contained  the  necessary  conversion  routines  in  RAM  or  ROM.  However,  by  providing  the  host 
processor  with  the  conversion  routines,  we  then  have  the  ability  to  observe  and  control  intermediate 
results.  Thus  the  host  can  be  used  as  a  parallel  monitor  to  help  debug  DSP  programs,  in  addition 
to  being  a  top  level  interface  to  the  DSP. 

The  ideal  situation  would  be  to  have  both  the  DSP  and  its  host  use  the  same  floating-point 
representation,  in  order  to  reduce  overhead  in  the  DSP  execution.  The  table  below  gives  the 
overhead  required  for  the  floating-point  conversion  routines  in  the  DSP32  running  at  16  MHz 
[AT&T  1988]. 


NUMBER  OF 
INSTRUCTIONS 

EXECUTION 
TIME  (psecs) 

dsp32 

12N  11 

3N  -»■  2.75 

ieee32 

16N  +  16 

4N  -»■  4 

Converting  large  quantities  of  data  in  addition  to  downloading  and  uploading  of  can  take  a 
considerable  amount  of  time.  Although  the  AT&T  DSP32C  provides  a  one  instruction  conversion 
process,  the  routine  must  still  iterate  through  all  data  points  for  conversion.  By  using  identical 
floating-point  formats  between  host  and  DSP,  the  amount  of  overhead  will  be  reduced  to  the  only 
downloading  and  uploading  processes.  One  such  DSP  which  provides  identical  formats  is  the 
Motorola  96002. 
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D  Appendix  -  DSP  Device  and  Board  Description 


D.l  DSP  devices 

AT&T  DSP32  and  DSP32C.  Since  the  AT&T  DSP32  was  used  for  benchmarking,  we  will 
examine  it  in  detail 

AT&T  developed  the  first  floating  point  processor  DSP32-2S0  (-250  refers  to  the  instruction 
cycle  in  nanoseconds)  capable  of  a  peak  performance  of  8  MFLOPS,  in  1984.  The  DSP32  is  a 
general-purpose  digital  signal  processor  vtdth  32-bit  floating  point  arithmetic.  The  floating  point  adder 
has  8  additional  bits  to  provide  higher  accuracy  when  summing  a  number  of  terms.  The  DSP32-160, 
a  faster  version  of  the  same  basic  design,  was  released  in  1986.  This  version  has  a  peak  performance 
of  12.5  MFLOPS. 

The  DSP32C  is  the  latest  release  and  is  available  in  both  100  ns  and  80  ns  versions.  It  is  not 
only  faster  than  the  original  DSP32  but  also  supports  additional  instructions.  The  DSP32C  also 
supports  a  faster  bus  transfer  rate  which  will  spe^  up  data  transfer  from  and  to  host  There  is  a 
substantial  price  difference  between  the  two  processors  due  to  the  speed  and  complexity. 

The  DSP32Cs  more  powerful  instruction  set  includes  a  no-overhead  do  loop.  This  feature 
could  provide  a  vector  loop  improvement  factor  of  2x.  The  addition  of  fast  conditional  check 
instructions,  ifff  and  ifalt,  will  speed  up  some  algorithms,  particularly  those  involving  operations  such 
as  MIN  and  MAX.  The  extended  addressing  of  the  DSP32C  provides  the  capability  of  a  greatly 
expanded  memory  (24-bit  address  space)  as  compared  with  the  l^bit  address  available  in  the  DSP32. 

Both  versions  of  the  DSP32  use  a  modified  version  of  the  von  Neumann  architecture.  The 
majority  of  the  operations  are  register  transfer  oriented.  As  such,  the  lower  level  operations  are 
similar  to  the  standard  microprocessors,  On  the  other  hand,  the  more  complex  operations,  such  as 
MAC,  use  a  notation  that  draws  from  C.  This  particular  feature  makes  the  assembly  level 
programming  of  the  DSP32  much  simpler  than  programming  the  conventional  microprocessor. 

CPUs.  The  DSP32  has  two  CPUs.  The  floating  point  CPU  is  called  a  Data  Arithmetic  Unit 
(DAU)  and  the  integer  CPU  is  called  a  Control  Arithmetic  Unit  (CAU). 

Accumulatorsiregbters.  The  DSP32  has  four  floating  point  accumulators.  A  total  of  21  integer 
registers  are  available.  These  registers  are  normally  used  for  memory  addressing  and  for  integer 
arithmetic.  Some  of  these  support  special  I/O  functions. 

Status  indication.  Status  indication  flags  are  implemented  for  both  processors.  These  flags 
are  affected  by  the  results  of  certain  instructions.  The  user  may  test  these  flags  using  conditional 
instructions. 

Memory.  Two  different  memory  areas  are  used.  These  are  on-chip  and  off-chip  memories. 
The  on-chip  memory  provides  the  fastest  access,  but  is  usually  quite  limited.  The  off-chip  memory 
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is  used  to  provide  for  bulk  data  storage. 

External  bus.  To  provide  a  fast  access  DSP’s  usually  separate  address  and  data  buses. 

Parallel  interface.  The  parallel  interface  provides  the  primary  means  for  data  transfer.  Usually 
the  parallel  interface  is  tied  to  the  host  bus  with  a  suitable  buffering. 

Serial  interface.  In  addition  to  the  parallel  interface,  a  high-speed  serial  interface  is  provided. 
Although  this  interface  was  not  used  during  the  initial  investigation,  it  could  provide  additional  data 
transfer  capability  with  multiple  DSP  boards. 

Motorola  96002  (96K)  The  Motorola  96002  uses  a  Harvard  architecture  and  supports  two 
separate  memory  banks.  This  type  of  architecture  is  particularly  suitable  for  handling  large  problems. 
Unfortunately,  the  Motorola  instruction  set  has  not  been  designed  for  the  types  of  operations  which 
are  commonly  encountered  in  statistical  computations.  The  instruction  set  has  been  optimized  for 
the  FFT  [EDN  1988].  As  a  result,  its  operation  set  is  not  quite  as  efficient  as  that  of  DSP32  when 
applied  to  vector  operations.  On  the  positive  side,  the  instruction  set  has  a  wide  variety  of  move  and 
store  operations,  including  register-to-register  and  memory-to-register  operations. 

TI 320C30  DSP  The  TI  320C30  is  a  very  popular  DSP  for  stand-alone  applications.  It  also 
has  a  real-time  operating  system  (SPOX).  It  uses  a  single  precision  floating  point  (24  bit  mantissa 
and  8  bit  exponent)  representation  which  is  a  non-IEEE  format,  requiring  data  conversion.  The  chip 
has  a  2K  X  32  internal  RAM  and  supports  16M  x  32  external  memory.  The  floating  point  format 
was  the  reason  why  this  DSP  was  not  further  considered  for  the  statistics  workstation  application. 

NEC  77230  DSP  This  DSP  supports  single  precision  floating  point  (24  bit  mantissa,  8  bit 
exponent).  It  has  a  IK  x  32  internal  RAM  and  supports  4K  x  32  external  program  memory  and  8K 
X  32  external  data  memory.  The  limited  memory  space  rules  it  out  as  a  candidate  for  statistics 
workstation  application. 

Fujitsu  86232  DSP  This  DSP  also  supports  single  precision  floating  point  (24  bit  mantissa 
and  8  bit  exponent).  It  has  a  512x32  internal  RAM.  However,  it  supports  64K  x  32  external  program 
and  IM  x  32  external  data  memory.  It  uses  IEEE  format  for  floating  point  operations.  It  also 
handles  fixed-point  and  integer  operations.  This  DSP  can  perform  a  32  bit  MAC  instruction  in  two 
75-nsec  clock  cycles.  The  Fujitsu  86232  is  a  very  recent  design  and  as  such  lacks  the  support  that  is 
available  for  other  DSP  chips  released  earlier.  It  may,  however,  be  a  good  candidate  for  future 
tradeoffs. 

Next-generation  DSPs  It  is  expected  that  a  number  of  new  and  more  capable  DSP  devices 
will  be  released  from  other  suppliers.  Thus,  we  can  expect  announcements  from  NEC,  Analog 
Devices,  and  other  suppliers  trying  to  establish  a  position  in  the  DSP  marketplace.  NEC  has 
indicated  that  a  new  DSP,  MPD77240,  will  be  available  later  this  year.  This  DSP  will  support  a  larger 
external  memory  space  both  for  program  (64K  x  32)  and  data  (16M  x  32).  Thus  it  also  could  be  a 
suitable  candidate  for  future  statist''^  workstation  applications. 

Some  of  the  new  microprocessors,  particularly  the  reduced  instruction  set  computers  (RISC 
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devices),  have  internal  pipelining  and  exhibit  DSP-like  capabilities”  The  Intel  i860  is  one  of  these 
new  processors.  Not  only  does  it  have  a  high  throughput,  but  it  also  supports  some  of  the  DSP 
operations,  as  well  as  graphical  operations.  Therefore,  the  i860  is  also  a  good  candidate  for  future 
statistics  workstation  expansion. 

Overall,  the  next-generation  DSP  architectures  (as  well  as  conventional)  borrows  heavily  from 
the  supercomputer  architecture.  Some  of  these  features  include  more  extensive  pipelining,  multiple 
address  and  data  buses,  etc. 

D.2  DSP  Boards 

A  number  of  commercial  DSP  boards  are  available.  The  majority  of  these  boards  have  been 
developed  for  audio  applications,  such  as  speech  processing.  As  a  result,  many  of  these  boards  have 
analog  signal  interfaces  and  therefore  are  quite  costly.  However,  for  the  statistics  workstation 
application,  the  analog  interface  normally  will  not  be  required.  Fortunately,  there  are  a  number  of 
DSP  boards  available  without  the  analog  interface.  A  brief  discussion  of  the  DSP32'based  boards 
follows. 


Communications  Automation  &  Control,  Inc.  offers  several  DSP32  and  DSP32C  boards. 
Their  least  expensive  board  (XNl-BO)  uses  DSP32  with  a  peak  rating  of  8  MFLOPS  and  a  list  price 
of  $795.  An  earlier  version  of  this  board  (DSP32-PC)  was  used  for  our  benchmarking.  The  DSP32C 
boards  start  at  $1695,  unpopulated  (basic  memory  requirements).  256  KBytes  of  zero  wait-state  static 
memory  lists  at  $1200.  The  same  amount  of  one  wait-state  static  memory  costs  half  as  much,  $600. 
Dynamic  memory  can  also  be  used,  but  will  carry  some  speed  penalty.  Since  the  static  memory  is 
quite  costly,  memory  can  be  partitioned  to  use  different  speeds  and  thus  achieve  a  lower  cost  solution. 

Burr-Brown  also  markets  a  PC/AT  compatible  DSP  board.  This  board  (ZPB34)  uses  the  80ns 
DSP32C.  It  is  available  with  64KB  to  576KB  of  high-speed  RAM.  Its  price  depends  on  the  specific 
memory  configuration  specified  and  range  from  $1995  (64  KBytes)  to  $4995  (576  KBytes).  This 
board  is  also  suitable  for  use  in  a  statistics  workstation.  However,  some  software  modification  would 
be  required  because  of  a  slightly  different  interface  arrangement. 

Other  DSP  board  vendors  include  Spectrum  Signal  Processing,  Ariel,  and  Vector.  Particularly 
interesting  is  the  recently  announced  Vector  32C/8500  board.  This  board  also  uses  AT&T  DSP32C 
and  is  populated  with  512  KBytes  of  static  RAM  and  8  MBytes  of  dynamic  RAM  for  $6995.  This 
board  could  be  well  suited  for  a  large-scale  statistics  workstation  operating  in  a  real  time  environment 
where  it  is  necessary  to  capture  and  process  large  amount  of  data. 

In  the  future,  additional  boards  will  become  available  and  the  DSP  board  prices  will  decrease. 
Part  of  this  decrease  will  be  due  to  lower  memory  prices,  as  static  RAM  production  will  increase. 


*  Some  observers  consider  the  DSPs  themselves  to  be  RISC  devices. 
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E  Appendix  -  Low-level  Subroutine  Performance. 


Table  E.1  gives  information  on  the  execution  of  the  low  level  BLAS  and  BSAS  routines 
provided  in  the  statistical  workstation  library.  The  code  for  these  routines  has  been  optimized  to 
provide  the  best  possible  execution  times.  The  information  in  the  table  was  obtained  from  DSP 
source  code  and  by  using  the  DSP  simulator,  which  provided  a  profile  of  the  code. 

E.l  Number  of  Instructions 

The  number  of  instructions  for  each  routine  can  be  calculated  from  the  DSP  source  code. 
The  majority  of  the  routines  depend  on  one  or  more  variables  which  determine  the  number  of 
iterations  in  a  loop.  These  loop  variants  are  passed  to  the  routines  as  parameters  indicating  the  size 
of  arrays,  and  are  represented  in  the  table  as  M,  N,  and  P. 

The  instructions  not  included  in  the  loop  are  represented  by  the  constant  in  the  formula  and 
can  usually  be  disregarded  when  the  loop  variant  is  very  large.  These  instructions  represent  the 
subroutine  overhead,  and  are  necessary  to  save  registers,  load  registers,  and  prepare  registers  for  the 
return  from  subroutine. 

Some  of  the  routines,  such  as  HEAP,  INDEX,  and  SROTG,  are  much  too  complex  to  give 
accurate  measures  on  the  number  of  instructions^'.  This  is  because  their  execution  is  determined 
by  ambiguous  factors,  such  as,  the  initial  ordering  of  the  array  or  initial  value  of  parameters.  For 
HEAP  and  INDEX  routines,  big  O  notation  is  used  to  describe  the  complexity  of  the  routine.  The 
complexity  of  SROTG  is  constant,  thus  an  estimation  is  given.  Other  routines,  such  as,  ISAMAX, 
MAJCIND,  and  MININD  also  have  ambiguities,  and  the  number  of  instructions  will  fall  between  the 
two  quantities  given. 

E.2  Number  of  Nops 

Because  the  DSP  processor  is  pipelined,  "nop"  instructions  are  necessary  to  allow  the 
processor  to  complete  previously  pipelined  instructions  so  that  their  results  can  be  used.  A  nop 
represents  a  null  operation,  and  when  the  processor  encounters  this  instruction  no  action  is  taken. 
Nops,  however,  can  reduced  the  efficiency  of  the  DSP  code  if  they  are  inserted  in  unnecessary 
positions  in  the  code.  The  number  of  nops  is  given  in  the  table  to  show  the  efficiency  of  the 
optimized  routines. 

The  majority  of  the  routines  in  the  table  have  a  constant  number  of  nops,  indicating  highly 
efficient  code.  Some  routines,  however,  contain  nops  within  the  loops,  which  is  indicated  by  the 
variable  N  in  the  formula.  For  example,  HISTINT  contains  7  nops  in  its  loop,  leaving  only  15 
effective  instructions.  These  routines  must  contain  these  nops  because  their  algorithm  requires  us 
to  continuously  use  previously  calculated  results.  Although  this  reduces  the  efficiency,  it  is 
unavoidable. 


INDEX  and  SROTG  are  not  included  in  Appendix  A  due  to  complexity  and  space  limitations. 
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E.3  Number  of  Wait  States 


The  DSP32  automatically  produces  wait  states  when  a  current  memory  address  conflicts  with 
a  memory  access  already  in  progress.  These  wait  states  allow  the  previous  memory  access  to  be 
properly  completed  before  the  next  address  is  placed  on  the  bus.  Although  this  allows  flexible 
memory  organization,  wait  states  degraded  throughput  by  adding  25%  to  the  instruction  cycle. 

The  memory  in  the  DSP32  is  partitioned  into  two  memory  banks,  an  upper  and  a  lower.  The 
memory  map  used  when  creating  the  table  was  to  place  the  code  and  data  in  the  lower  bank,  while 
placing  the  stack  in  the  upper  bank.  Thus,  because  code  and  data  are  located  in  the  same  memory 
bank,  wait  states  are  introduced  when  a  data  read,  a  data  write  or  an  instruction  fetch  occur 
consecutively. 

Maximum  throughput  can  be  achieved  by  alternating  memory  accesses  between  the  two 
memory  banks,  thus  eliminating  any  wait  states.  All  of  the  wait  states  shown  in  the  table  can  be 
reduced  to  a  constant  or  zero  by  wisely  placing  some  data  in  the  lower  bank,  and  some  in  the  upper 
bank.  However,  this  choice  can  be  difflcult,  and  currently  is  done  manually.  There  is  also  no 
guarantee  that  one  particular  memory  format  will  eliminate  wait  states  for  all  routines. 

E.4  Number  of  FLOPS 

In  the  table,  the  number  of  FLOPS  (floating-point  operations)  represents  the  number  of  inner 
arithmetic  calculations  required  to  produce  a  floating-p>oint  result.  TTiis  term  should  not  be  confused 
with  the  performance  measure  floating-point  operations  per  second.  Instructions  involving  addition, 
subtraction,  multiplication  or  division  of  floating-point  numbers  are  considered  FLOPS.  The  DSP 
instruction  multiply-accumulate  is  an  example  of  an  instruction  containing  inner  calculations,  and  is 
then  considered  to  take  2  FLOPS. 

Other  instructions  that  were  not  considered  but  are  worth  mentioning  are  those  DSP 
instructions  which  also  use  the  floating-point  DAU  (data  arithmetic  unit).  Examples  of  these 
instructions  are;  float,  int,  and  ifalt.  These  were  not  considered  because  they  are  higher  level 
instructions  and  do  not  involve  any  obvious  arithmetic,  however,  since  their  execution  involves  using 
the  DAU  they  should  not  be  totally  overlooked. 

E.5  Execution  Time 

The  execution  time  of  the  routines  given  in  the  table  is  based  on  the  DSP32  operating  at 
16MHz,  giving  an  instruction  cycle  of  250ns.  Calculation  of  the  execution  time  includes  the  number 
of  instructions  and  one-quarter  of  the  number  of  wait  states.  Again  it  is  worth  mentioning  that  by 
wisely  distributing  data  between  memory  banks,  wait  states  can  be  eliminated  and  execution  time 
reduced.  Selection  of  other  DSP’s  operating  at  faster  clock  rates  will  also  reduce  execution  time. 

E.6  DSP32C/DSP32  Ratio 


Execution  time  and  performance  can  be  improved  by  implementing  the  BLAS  and  BSAS 
routines  on  the  AT&T  DSP32C.  This  DSP  operates  with  an  instruction  cycle  of  80ns,  which  creates 
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an  obvious  increase  in  execution  time.  In  addition  the  DSP32C  can  improve  performance  of  the  code 
with  its  no-overhead  loop  instruction.  By  implementing  this  option,  it  is  possible  to  reduce  most  of 
the  loops  given  in  the  library  by  one  instruction. 

The  ratio  given  in  the  table  is  determined  assuming  N  is  very  large  (N  >  >  100).  All  routines 
execute  at  least  3. 125  times  faster  on  the  DSP32C  because  of  the  change  in  the  clock  rate.  Routines 
which  currently  contain  2  instructions  in  their  loop  can  possibly  double  this  by  including  the  no¬ 
overhead  looping  construct.  The  effect  of  wait  states  are  not  included  in  this  factor,  in  most  cases 
they  remain  approximately  same  for  the  DSP32C. 
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Routine 

Number  of 
instructions 

Number 
of  nops 

Number 
of  waits 

Number 
of  flops 

Execution  time 
(psecs) 

DSP32C 
/32  ratio 

ABSDEV 

5N  +  17 

1 

2 

2N 

1.25N  +  4.375 

3.91 

ADDCPY 

2N  +  14 

1 

2N-2 

.625N  +  3.375 

6.25 

ADDSCAL 

2N  +  14 

2 

2N-2 

2N 

.625N  +  3.375 

6.25 

ADDSCALCPY 

2N  +  18 

2 

2N-2 

2N 

.625N  +  4.375 

6.25 

ADDVEC 

2N  +  9 

2 

4N-2 

N 

.75N  +  2.125 

6.25 

CDF 

2N  +  15 

1 

2N-  1 

N 

.625N  +  3.6875 

6.25 

CENTER 

2N  +  10 

1 

2N-2 

N 

.625N  +  2.375 

6.25 

CSUM 

2N  +  11 

1 

1 

N 

.5N  +  2.8125 

6.25 

CSUMSQ 

4N  +  15 

N  +  1 

2N  +  1 

3N 

1.125N  +  3.8125 

3.13 

DIST 

4N  +  15 

N  +  1 

2N  +  1 

3N 

1.125N  +  3.8125 

3.13 

EXPSM 

3N  +  24 

2 

2N  +  1 

3N  +  1 

.875N  +  .60625 

4.69 

FILL 

2N  +  12 

1 

N 

0 

.5625N  +  3 

6.25 

FLOATA 

2N  +  14 

1 

2N 

.625N  +  3.5 

6.25 

HEAP 

CK  N  logj  N  ) 

3.13 

HISTINT 

22N  +  30 

7N  +  1 

8N  +  3 

3N  +  2 

6N  +  7.6875 

3.13 

HISTOG 

19N  +  32 

5N  +  1 

ION  +  3 

4N  +  2 

5.375N  +  8.1875 

3.13 

HORN 

3N  +  11 

N  +  1 

N 

2N 

.8125N  +  2.75 

3.13 

INDEX 

CK  N  logj  N  ) 

3.13 

INTA 

2N  +  12 

1 

2N 

0 

.625N  +  3 

6.25 

ISAM  AX 

ION  +  5  <  < 
ION  +  7 

N  +  1 

2N 

N  +  1 
<  < 

2N 

2.625N  +  1.25  << 
2.625N  +  1.75 

3.13 

LIMIT 

8 

1 

0 

2 

2.0 

3.13 

MAC 

2N  +  18 

1 

4N  -2 

2N 

.75N  +  4.375 

6.25 

MATMATT 

N[(2P+3)N  + 

5]  +  21 

1 

N^2P+1) 

+  1 

2N^ 

.625N^P+1.3)  + 
1.25N  +  5.3125 

6.25 

MATMULT 

M[(2N+5)P+ 

4]  +  30 

1 

MP(2N+ 

1)  +  1 

2MNP 

.625MP(N+2.1)  + 

M  +  7.5625 

6.25 

MATMULTl 

M[(2N+5)P+ 

4]  +  30 

1 

MP(2N  + 

1)  +  1 

2MNP 

.625MP(N+2.1)  + 

M  +  7.5625 

6.25 

Table  E.1  :  Timing  of  BLAS/BSAS  routines 
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Routine 

Number  of 
instructions 

Number  of 
nops 

Number  of 
waits 

Number 
of  flops 

Execution  time 
(psecs) 

DSP32C 
/  32  ratio 

MATMULT2 

N[(2M+5)P 
+4]  +  32 

1 

NP(2M  +  1) 

+  1 

2MNP 

.625NP(M+2.1) 

+ 

N  +  8.0625 

6.25 

MATTMAT 

P((2N-*-5)P 
+4]  +  21 

1 

P^2N+1) 

+  1 

2NP* 

.625P^+2.1)  + 

P  +  5.3125 

6.25 

MATVEC 

M(2N+3)  + 

13 

2 

M(2N+1) 

+  1 

2MN 

.625M(N+1.3)  + 
3.3125 

6.25 

MAXA 

3N  +  10 

1 

N  +  1 

N 

.8125N  +  2.5625 

4.69  1 

MAXIND 

9N  +  8 
<  <  ION  + 

4 

N  +  1  << 
2N  -1 

N 

N  -  1 

2.3125N  +  2  << 
2.5625N  +  1 

3.13 

MEAN 

5N  +  24 

4 

3 

3N  +  3 

1.25N  +  6.1875 

5.21 

MEDIAN 

39  +  HEAP 

7  +  HEAP 

2  +  HEAP 

3 

9.875  +  HEAP 

3.13 

MINA 

3N  +  10 

1 

N  +  1 

N 

.8125N  +  2.5625 

4.69 

MININD 

9N  +  8 
<  <  ION  + 

4 

N  +  1  << 
2N-1 

N 

N  +  1 

2.3125N  +  2  <  < 
2.5625N  +  1 

3.13 

MINMAX 

5N  +  17 

2 

2N  +  4 

2N  +  1 

1.375N  +  4.5 

3.13 

MOMENT 

6N  +  19 

N  +  1 

2N  +  5 

5N 

1.625N  +  5.0625 

3.75 

PROD 

3N  +  11 

1 

1 

N 

.75N  +  2.8125 

3.13 

QABS 

5 

1 

0 

0 

1.25 

3.13 

QABSA 

3N  +  10 

1 

3N 

0 

.9375N  +  2.5 

4.69 

QMAX 

6 

1 

0 

1 

1.5 

3.13 

QMIN 

6 

1 

0 

1 

1.5 

3.13 

RANK 

7N  +  8 

1 

N 

0 

1.81257N  +  2 

3.13 

SASUM 

4N  +  11 

1 

2N  +  1 

N 

1.125N  +  2.8125 

4.17 

SAXPY 

2N  +  14 

1 

4N  -2 

2N 

.75N  +  3.375 

6.25 

SCALCPY 

2N  +  14 

1 

2N  -2 

N 

.625N  +  3.375 

6.25 

SCOPY 

2N  +  13 

1 

2N 

0 

.625N  +  3.25 

6.25 

SDOT 

2N  +  15 

1 

2N  +  1 

2N 

.625N  +  3.1825 

6.25 

SIGN 

9 

1 

0 

0 

2.25 

3.13 

Table  E.l  :  Timing  of  BLAS/BSAS  routines  (continued) 
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Routine 

Number  of 
instructions 

Nund)er  of 
nops 

Number  of 
waits 

Number  of 
flops 

Execution  time 
(psecs) 

DSP32C 
/  DSP32 
ratio 

SIGNA 

6N  +  14 

1 

4N 

0 

1.75N  +  3.5 

3.75 

SNRM2 

2N  +  20  + 
SQRT 

4  +  SQRT 

2N  +  1  + 
SQRT 

2N  + 

SQRT 

.625N  + 

5.0625  + 

SQRT 

6.25 

SROT 

5N  +  20 

1 

3N  +  2 

6N 

1.4375N  + 
5.125 

3.91 

SROTG  t 

123  <  <  428 

39  <  <  112 

17  <  <  53 

n 

31.8125  << 
110.3125 

3.13 

SSCAL 

2N  +  10 

1 

2N-2 

7 

.625N  + 

2.375 

6.25 

SSQR 

2N  +  11 

1 

2N  +  1 

2N 

.625N  + 

2.8125 

6.25 

SSWAP 

4N  +  13 

1 

4N 

0 

1.25N  +  3.25 

4.17 

SUBVEC 

2N  +  9 

2 

4N-2 

N 

.75N  +  2.125 

6.25 

SUMUNTIL 

8N  +  14 

3N  +  2 

2 

2N 

2N  +  3.625 

3.13 

TRANSP 

N(2P+4)  + 

13 

1 

2NP 

0 

.625N(P+1.6) 

+  3.25 

6.25 

UPDATPROD 

3N  +  21 

2 

4N 

2N 

N  +  5.25 

4.69 

UPDATSQR 

3N  +  14 

1 

4N 

2N 

N  +  3.5 

4.69 

UPDATSUM 

2N  +  13 

1 

4N  -2 

N 

.75N  +  3.125 

6.25 

VECMAT 

P(2N+5)  + 

21 

1 

P(2N+1)  + 

1 

2NP 

.625P(N+2.1) 

+  5.3125 

6.25 

WDOT 

4N  +  19 

N  +  1 

2N  +  1 

3N 

1.125N  + 
4.8125 

3.13 

Table  E.  1  :  Timing  of  BLAS/BSAS  routines  (continued) 


SQRT  subroutine  given  in  AT&T  library  contains: 
Number  of  instructions:  SO 
Number  of  nops:  5 

Number  of  waits:  16 

Number  of  FLOPS:  35 

Execution  Time:  13.5  psecs 


t  Calculations  for  SROTG  are  based  on  estimations. 
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F  Appendix  -  DSP  COFF  file  description 


DSP  asscmbkr.  The  DSP  assembler  translates  assembly  language  Gles  into  machine  coded 
instructions,  producing  object  flies.  These  flies  include  DSP  code  instructions,  relocation  information, 
global  identifiers,  and  externals.  Binary  instructions  are  grouped  into  sections.  Assembler  directives 
identify  these  sections  in  the  source  file.  Each  section  of  the  source  file  is  assembled  using  its  own 
location  counter  with  a  default  value  of  zero.  The  assembler  also  has  the  capability  to  set  up  the 
location  counters.  After  the  individual  sections  have  been  compiled,  they  are  combined  by  the  link 
editor.  For  example,  all  of  the  DSP  routines  in  Appendix  A  were  assembled  into  object  flies  and 
placed  in  a  h'brary  for  access  by  a  high-level  language. 

DSP  link  editor.  The  link  editor  creates  load  modules  by  combining  object  flies,  performing 
relocation,  resolving  external  references,  and  supporting  symbol  table  information  for  symbolic  testing. 
By  combining  relocatable  object  flies,  the  link  editor  produces  an  absolute  executable  object  file.  The 
link  editor’s  command  language  permits  specification  of  memory  configuration  of  the  DSP, 
combination  of  sections,  locating  sections  at  specified  addresses,  and  definition  and  redefinition  of 
global  symbols.  This  capability  permits  precise  control  over  the  object  files  and  their  position  in 
memory,  lliis  is  accomplished  by  binding  the  object  code.  Binding  in  the  linking  process  refers  to 
specifying  a  starting  address  in  the  memory. 

DSP  object  files  are  produced  by  both  the  assembler  and  the  link  editor.  The  link  editor 
accepts  relocatable  object  files  as  input  and  produces  an  output  object  or  executable  file  which  cannot 
be  relocatable.  Hies  produced  from  the  assembler  are  in  the  common  object  file  format  (COFF). 
The  object  file  consists  of  a  file  header,  optional  header  information,  a  table  of  section  headers,  the 
data  for  each  section,  relocation  information,  line  numbers,  and  a  symbol  table. 

COFF  files.  This  file  format  is  used  both  by  the  DSP  assembler  and  the  link  editor.  There 
are  many  advantages  of  using  COFF  files.  The  most  important  one  is  that  the  COFF  files  contain 
all  of  the  necessary  information  needed  for  DSP  operation.  If  DOS  files  were  used  much  of  this 
information  would  have  to  be  generated  locally  by  a  separate  program. 

Although  the  COFF  files  contain  information  which  is  not  always  needed,  this  information 
can  be  easily  bypassed  because  the  file  and  section  headers  contain  pointers  indicating  where  the 
various  data  elements  are  stored. 

COFF  file  structure  is  defined  in  the  UNIX  documentation  and  is  relatively  complex  and 
highly  flexible  to  permit  its  use  in  a  variety  of  situations.  Fortunately  for  implementing  DSP 
programs,  only  a  subset  of  the  available  capabilities  is  needed. 

A  short  description  of  COFF  is  provided  below.  Complete  details  are  available  in  the  UNIX 
documentation. 

File  header.  File  header  contains  general  information.  This  information  can  be  used  to 
determine  if  proper  file  format  has  been  specified.  For  example,  the  first  entry  in  the  file  header  is 
the  "Magic  Number"  that  specifies  the  system  and  the  processor  on  which  the  code  is  executable. 
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Checking  this  number  helps  to  determine  if  the  file  is  compatible  with  the  processor  used  in  the 
system. 


Sections.  COFF  file  is  divided  into  sections.  Each  section  has  its  own  header  which  contains 
general  data  description.  A  section  is  identified  by  a  starting  address  and  a  size.  The  physical  address 
of  a  section  is  an  offset  which  can  be  used  to  determine  the  absolute  address.  Tfae  section  is  the 
smallest  program  unit  of  relocation  and  must  be  a  continuous  block  of  memory.  Sections  from  input 
files  are  combined  to  form  output  sections  that  contain  executable  code.  Although  there  may  be 
holes  between  output  sections,  storage  is  allocated  contiguously  within  each  output  section.  Since 
the  section  order  is  program  dependent,  this  order  is  retained  in  the  COFF  file  and  is  used  by  the 
data  extraction  program. 

The  specific  section  types  are  indicated  by  the  section  header  flags.  The  key  sections  include 
executable  code  (.text),  initialized  data  (.data)  and  uninitialized  data  (.bss).  Symbolic  names  for  these 
sections  are  shown  in  parentheses.  In  addition  to  these  sections,  there  are  others  for  comments, 
overlays,  libraries,  and  others.  Altogether  COFF  allows  twelve  different  section  formats. 

Symbolic  labels  and  table.  Although  symbolic  labels  (names  of  variables)  play  an  important 
role  during  the  program  development  and  debugging,  symbolic  labels  are  not  needed  in  the 
operational  DSP  program  and  are  stripped  off  by  the  load  editor  when  loading  the  COFF  file  to  the 
DSP.  Therefore,  if  symbolic  access  to  specific  DSP  memory  locations  is  desired,  the  host  program 
must  maintain  a  DSP  symbol  table  which  contains  the  specific  DSP  memory  addresses.  Global  and 
external  symbols  are  then  kept  in  a  symbol  table  in  order  to  resolve  references  across  input  files. 

The  symbol  table  contains  all  of  the  applicable  symbols  and  their  classes.  This  includes  names 
for  files,  functions,  local  symbols,  statics,  and  global  symbols.  The  type  field  in  the  symbol  table  entry 
specifies  the  type  of  the  symbol,  such  as  character,  integer,  floating  point,  or  other.  In  the  COFF  file 
16  different  symbol  types  can  be  identified”. 

This  table,  however,  should  be  limited  to  only  those  symbolic  labels  which  are  needed  during 
normal  operation  of  the  program.  The  initializing  utility  program  determines  if  any  duplicate  labels 
exist  and  issues  error  message  and  diagnostic  information  in  case  of  duplicate  labels.  Typically,  most 
of  the  global  labels  in  the  DSP  program  will  be  included  in  the  symbol  table. 


”  If  the  symbol  name  is  eight  characters  or  less  then  the  full  symbol  name  is  stored  in  the  symbol  table. 
Otherwise  the  DSP32  compiler  considers  only  the  first  eight  characters  to  be  significant. 


161 


G  Appendix  -  Statistical  Software  Survey 


This  appendix  reviews  some  of  the  important  features  of  existing  software  packages  and 
support  tools  that  could  be  incorporated  into  the  statistics  workstation. 

G.l  Statistical  packages 

S-Language.  The  S  language  is  very  flexible,  has  a  large  user  base,  and  can  handle  a  large 

variety  of  statistical  computations  [Becker  1988].  The  developers  of  this  language  call  it  a 

programming  environment  for  data  analysis  and  graphics.  This  high-level  interactive  language  is 
integrated  in  a  UNIX  environment  and  has  many  similarities  to  the  C  language.  Data  management 
support  allows  easy  organization,  storage,  and  recall  of  data.  The  S  language  library  also  contains  an 
extensive  collection  of  well  known  statistical  data  bases. 

The  basic  issues  to  address  when  considering  the  use  of  the  S  language  are: 

o  Because  the  high-level  commands  are  interpreted,  the  computation  rate  is  significantly 

reduced.  This  loss  in  computation  speed  is  partially  compensated  by  using  compiled 

procedures  and  functions. 

o  The  use  of  compiled  functions  and  procedures  require  that  before  these  are  added  to  the 
library,  they  must  be  compiled  and  stored  in  the  library  module.  The  added  functions  and 
procedures  can  be  programmed  in  either  FORTRAN  or  C. 

One  disadvantage  of  using  the  S  language  is  the  need  to  learn  the  command  language 
structure.  This  structure  is  complex  and  some  of  the  commands  may  appear  awkward  to 
inexperienced  users. 

Statgraphics.  Statgraphics  is  another  integrated  system  for  interactive  data  analysis.  It  also 
supports  data  management  and  provides  a  flexible  graphics  display  capability.  Selection  of  variables, 
data  transformation,  and  selection  of  options  is  done  in  screen  editing  format,  instead  of  user  typed 
commands.  Statgraphics  is  unique  in  that  it  is  based  on  APL,  a  very  compact  and  terse  language 
suitable  for  vector  and  matrix  operations.  Therefore,  knowledge  of  APL  is  very  helpful  when  using 
and  extending  the  capabilities  of  this  program. 

TIMESLAB.  TIMESLAB  [Newton  1988]  has  been  developed  as  a  time-series  teaching 
program.  A  wide  range  of  commands  are  available  and  these  commanck  can  be  used  to  develop  more 
complex  macro  commands.  This  particular  package,  however,  is  directed  at  a  more  specific  audience. 

Simulation  Languages.  Simulation  is  becoming  more  important  in  statistical  analysis. 
Although  the  majority  of  simulation  languages  are  general  purpose,  they  can  be  applied  to  a  variety 
of  statistical  analyses.  The  most  often  used  simulation  languages  include  GPSS  (general  purpose 
simulation  system)  and  SIMSCRIPT.  GPSS/PC  is  highly  interactive  and  supports  all  edit,  compile, 
link,  run,  and  debug  operations  in  an  integrated  environment.  Typical  applications  include  business, 
warehousing,  manufacturing,  distribution  and  other  similar  discrete  systems. 
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G.2  Spreadsheets 


The  majority  of  the  spreadsheets  include  some  form  of  statistical  computation  capability.  In 
addition,  the  macro  language  supplied  gives  the  user  some  programming  flexibility.  However,  a  major 
disadvantages  in  using  spreadsheets  is  the  slower  speed  due  to  the  interpretation  of  most  commands. 
The  popularity  of  the  spreadsheet  format  is  evidenced  by  the  number  of  statistics  packages  that 
include  some  form  of  spreadsheet  data  input  (such  as  Minitab). 

Conventional  Spreadsheets.  Representative  examples  of  commonly  available  spreadsheets  are 
Borland-Quattro™  and  Lotus-123''**,  lliey  both  have  limited  statistical  commands;  multivariate  linear 
regression  is  one  of  the  most  powerful  of  these.  These  spreadsheets,  however,  benefit  from  wide 
usage  in  applications  ranging  from  business  to  scientific  computations.  It  is  important  to  note  the 
important  advantages  and  liabilities  of  spreadsheet  programs: 

o  The  spreadsheets  have  good  data  handling  capabilities.  The  data  is  always  visible  to  the  user. 

The  spreadsheet  commands  are  highly  interactive  and  menu-driven, 
o  Spreadsheets  have  good  graphing  capabilities,  although  due  to  their  origins  in  the  business 
environment,  they  tend  to  emphasize  pie  charts  and  bar  graphs, 
o  They  feature  symbolic  and  algebraic  programming  capabilities.  However,  the  spreadsheet 
commands  and  formulas  are  interpreted  through  a  macro  processor.  The  interpretation 
introduces  overhead  and  makes  any  lengthy  computation  very  slow, 
o  Due  to  the  wide  use  of  the  packages,  and  in  particular  of  Lotus  123'**,  add-on  programs  exist 
which  are  designed  to  speed  up  or  increase  the  flexibility  of  certain  aspects  of  the  package. 

Stochastic  Spreadsheet.  An  interesting  offshoot  of  the  spreadsheet  is  the  stochastic 
spreadsheet,  which  is  designed  to  do  Monte  Carlo  simulations  [Rubinstein  1986]  on  what  are 
essentially  spreadsheet  formulas  [Coe  1989].  This  enables  the  users  to  do  probabilistic  sensitivity 
analysis  instead  of  the  sensitivity  tables  currently  employed  for  one  or  two  values  [Quattro  1989].  As 
with  most  Monte  Carlo  experiments,  the  simulation  times  can  become  quite  lengthy  for  complicated 
expressions.  In  a  similar  fashion,  special  purpose  spreadsheets  for  data  analyses  can  be  developed. 

G.3  Algebra  and  Matrix  Packages 

There  are  several  packages  available  that  concentrate  on  the  mathematical  aspects  of  problem 
solving.  They  can  be  considered  as  auxiliary  tools  for  the  statistician.  These  include  MATLAB''** 
(primarily  for  matrix  computations  [Coleman  1988]),  MathCad™,  Mathematica™,  and  Macsyma™. 
These  packages  are  all  highly  interactive,  in  that  most  equations  can  be  positioned  on  a  scratch-pad 
or  worksheet.  They  also  feature  symbolic  computations  that  are  not  restricted  to  cell  references  as 
in  the  spreadsheet  packages.  Iteration  and  step  size  control  is  provided  for  doing  integrations. 

The  packages  offer  simple  and  flexible  programming.  They  are  often  used  to  prototype  an 
algorithm  before  the  high-level  language  version  is  to  be  written.  Several  packages  have  sophisticated 
graphics  (such  as  contour  plots  and  3D  views  in  Mathematica)  which  make  them  suited  for 
exploratory  data  analysis.  The  disadvantage  of  these  programs  is  that  much  of  the  computation  is 
interpreted  which  leads  to  slower  execution  time. 
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ACRONYMS 


80x86 

either  8086,  80286,  386,  486  Intel  processor. 

|iS 

10*  second 

ID 

one  dimensional 

2D 

two  dimensional 

AI 

artificial  intelligence 

ALU 

arithmetic  logic  unit 

ANOVA 

analysis  of  variance 

AR 

auto  regressive 

ARMA 

auto  regressive  moving  average 

BIOS 

basic  input/output  services 

BLAS 

basic  linear  algebra  subroutines 

BSAS 

basic  statistical  analysis  subroutines 

CAD 

computer-aided  design 

CAU 

control  arithmetic  unit 

CGA 

color  graphics  adaptor 

a 

computation-intensive 

asc 

complex  instruction  set  computer 

CMOS 

complementary  metal-oxide  semiconductor 

COFF 

common  object  file  format 

CPU 

central  processing  unit 

DAU 

data  arithmetic  unit 

DMA 

direct  memoiy  access 

DOS 

disk  operating  system 

DRAM 

dynamic  RAM 

DSP 

digital  signal  processor 

EGA 

enhanced  graphics  adaptor 

FFT 

fast  Fourier  transform 

FIR 

finite  impulse  response 

FLOPS 

floating  point  operations  per  second 

FPU 

floating  point  unit 

I/O 

input/output 

IEEE 

Institute  of  Electrical  and  Electronic  Engineers 

ifalt 

if  accumulator  less  than 

iid 

independent  identically  distributed 

UR 

infinite  impulse  response 

MA 

movuig  average 

MAC 

multiply  accumulate 

MB 

million  bytes 

MC 

Monte  Carlo 

MFLOPS 

million  floating  point  operations  per  second 

MHz 

million  cycles  per  sec 

ms 

10'^  second 

MSDOS 

Microsoft  DOS 
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NMOS 

n-type  metal-oxide  semiconductor 

NN 

neural  networks 

nop 

no  operation 

ns 

nanosecond 

PC 

personal  computer 

PP 

projection  pursuit 

RAM 

random  access  memory 

RISC 

reduced  instruction  set  computer 

rms 

root  mean  square 

ROM 

read  only  memory 

SAXPY 

Single-precision  A  *  X  Plus  Y 

SDL 

software  description  language 

SOR 

simultaneous  over  relaxation 

SPC 

statistical  process  control 

SQC 

statistical  quality  control 

SRAM 

static  RAM 

SVD 

singular  value  decomposition 

SW 

statistical  workstation 

VGA 

video  graphics  array 

VHDL 

VHSIC  hardware  description  language 

YACC 

yet  anothr^  compiler  compiler 

SYMBOLS 


X 

mean 

k 

index 

X 

vector 

M 

number  of  elements 

% 

modulus  (C  language) 

mod 

modulus 

& 

bitwise  AND  (C  language) 

N 

number  of  elements 

&& 

logical  AND  (C  language) 

P 

number  of  elements 

-h-f- 

increment  operator 

P(t) 

probability 

a 

scalar 

sup 

suprenum 

e 

iid  noise 

t 

time 

X 

failure  rate 

w 

scalar  result 

fi 

repair  rate 

Xx 

vector  element 

a 

rms  deviation 

yx 

vector  element 

a 

scalar  element 

Zi 

vector  element 

A 

matrix 

- 

replace 

A^ 

transpose  of  matrix 

R 

set  of  real  numbers 

h 

scalar  element 

n 

product 

B 

matrix 

I. 

summation 

c 

scalar  element 

e 

member  of 

C 

matrix 

dot  product 

f(x) 

continuous  function 

exchange 

i 

index 

\x  1 

absolute  value 

I 

Identity  matrix 

ll^ll 

norm  of  vector 

j 

index 

16j 


